In a recent interview, George Hotz claimed that GPT-4 is just an eight-way mixture model of 220B parameters. It could be a Mixture of Experts (MoE) model. That estimates GPT-4 at about 1.2 trillion parameters (8 x 220 billion).
Models often reuse the same parameters for all inputs. But Mixture of Experts models uses different parameters based on the example. You end up with a sparsely activated ensemble.
Routing multiple models isn’t easy. There’s overhead in communication between the models and routing between them. Switch Transformers (by researchers at Google), a gating network (typically a neural network), produces a sparse distribution over the available experts. It might only choose the top-k highest-scoring experts, or softmax gating, which encourages the network to select only a few experts.
Getting the balance right is still tricky — ensuring that specific experts aren’t chosen too often.
Some other interesting facts and implications:
- GPT-4 costs about 10x the cost of GPT-3.
- GLaM is Google’s 1.2T model with 64 experts.
- There’s also some analysis on unified scaling laws for routed language models.
- Ensemble networks were some of the most powerful models in the neural network deployment era. Maybe that’s still the case.