react-llm is a set of headless React components to run an LLM completely clientside in the browser with WebGPU, starting with useLLM
.
There’s a live demo running on chat.matt-rickard.com. I put together a quick retro UI that looks like an AOL AIM instant message with a “SmartestChild” to demonstrate how to use the library (it’s made to bring your own UI). It only works on the newest versions of Chrome (>113) on Desktop.
LLMs are both (1) expensive to inference and (2) hard to self-host. There’s been a lot of work to run these in the browser (“the new OS”), but they are tough to set up and integrate into modern front-end frameworks. What if you could serve models entirely clientside? With WebGPU shipping, it’s beginning to be a reality.
react-llm
sets everything up for you — an off-the-main-thread worker that fetches the model from a CDN (HuggingFace), cross-compiles the WebAssembly components (like the tokenizer and model bindings), and manages the model state (attention kv cache, and more). Everything runs clientside — the model is cached and inferenced in the browser. Conversations are stored in session storage.
- Everything is customizable about the model — from the system prompt to the user and assistant role names.
- Completion options like max tokens and stop sequences are available in the API
- Supports the LLaMa family of models (starting with Vicuna 13B).
The API is simple — use it as a React hook or context provider:
<ModelProvider>
<YourApp />
</ModelProvider>
Then in your component,
const {send, conversation, init} = useLLM()
See the docs for the entire API.
How does it work? There are many moving parts, and not surprisingly, it requires a lot of coordination between systems engineering, browser APIs, and frontend frameworks.
- SentencePiece (the tokenizer) and the Apache TVM runtime are compiled with emscripten. The folks working on Apache TVM and MLC have done much low-level work to get the runtime working in the browser. These libraries were initially written in Python and C++.
- Both of these are initialized in an off-the-main-thread WebWorker. This lets the inference happen outside the main render thread, so it doesn’t slow down the UI. This worker is packaged alongside the React hooks.
- The worker downloads the model from HuggingFace and initializes the runtime and tokenizer.
- Then, some tedious state management and work to make it easily consumable via React. There are hooks, contexts, and providers which make it easy to use it across your application.