In a world where “scale is all you need,” sometimes the biggest models don’t win. Some reasons why smaller LLMs might pull ahead.
Many of these points follow from each other.
- Quicker to train. Obvious, but quicker feedback means faster iterations. Faster training, faster fine-tuning, faster results.
- Runs locally. The smaller the model, the more environments it can run in.
- Easier to debug. If you can run it on your laptop, it’s easier to debug.
- No specialized hardware. Small LLMs rarely require specialized hardware to train or run inference. In a market where the biggest chips are high in demand and low in supply, this matters.
- Cost-effective. Smaller models are cheaper to run. This opens up more NPV-positive applications they can work on.
- Lower latency. Smaller models can generate completions faster. Most models can’t run in low-latency environments today.
- Runs on the edge. Low latency, smaller file size, and shorter startup times mean that small LLMs can run at the edge.
- Easier to deploy. Getting to production is sometimes the hardest part.
- Can be ensembled. It’s rumored that GPT-4 is eight smaller models. Ensembling smaller models together is a strategy that’s worked for decades of pragmatic machine learning.
A few more conjectures on why small models might be better:
- More interpretable? We don’t have a defining theory on interpretability of LLMs, but I imagine that we’ll understand more of what’s going on in 7 billion parameter models before we know what’s going on in 60 billion parameter models.
- Enhanced reproducibility? Small LLMs can easily be trained from scratch again. Counter this with the largest LLMs, which might undergo multiple checkpoints and continued training. Reproducing a model that was trained in an hour is much easier than one trained in six months.