Will models scale better with bigger models (more compute) or more data? What is the current constraint to model performance – data or compute?
For the last few years, AI research has centered around throwing more compute at the problem. OpenAI published a paper, Scaling Laws for Neural Language Models in 2020 that showed that scaling models had better returns than adding more data.
Companies raced to increase the number of parameters in their models. GPT-3, released a few months after the paper, contains 175 billion parameters (model size). Microsoft released DeepSpeed, a deep learning optimization suite that could handle (theoretically) trillions of parameters.
More recently, in 2022, DeepMind showed that both model size and the number of training tokens should be scaled equally – Training Compute – Optimal Large Language Models (2022).
GPT-3 was trained on roughly 300B training tokens (data size). Other models at the time also training on roughly 300B tokens. Why? Probably because that's what GPT-3 did.
DeepMind tried to distribute the compute and data more evenly. It created a new LLM called Chinchilla, trained on only 70 billion parameters and 1.4 trillion training tokens.
It beat every other model that was only trained with 300B tokens– no matter how many parameters it contained: 300B, 500B, or 1T.
So it seems like data is the constraint – adding more data will give you more bang for your buck. At least for now.