Recently, I’ve been experimenting with on-device AI — libraries to run an LLM inside the context of a Chrome Extension (LLaMaTab) and, more generally, to embed in any static React App (react-llm). But will on-device AI be the prevailing paradigm, or will we always use hosted APIs?
The answer is both. But the space of interesting local LLM use cases is growing.
Some interesting properties of on-device AI:
- Decentralizes the cost of serving the models. This opens up a class of use cases that wouldn’t be economically feasible to serve.
- Smaller models are quicker to iterate on. More developers can experiment with them.
- Better fit for specific modalities (e.g., speech-to-text).
- The incentive for certain companies to ship this (i.e., Apple, other hardware companies).
The benefits of serverside inference
- Economies of scale to hosting models -- parameters can be loaded once and serve larger batch sizes, amortizing the cost.
- Online training and augmentation. Incorporate new data via a search index or other data source.
- Fundamental limits on chips and RAM mean huge models can’t be served locally. Cloud is elastic.
Possible futures?
- Hybrid architecture — LLMs are (currently) stateless APIs, so it might be possible to string together both local and hosted models.
- Many small on-device models. Instead of a single model, routing to multiple small, purpose-built models to accomplish tasks.
- ASIC for transformers that accelerates inference on-device in a meaningful way.
- Software optimizations that drastically lower the resources required to inference a model (see my hacker’s guide to LLM optimization).