GPT-3 ushered in a new era of large language models (LLMs) that could generate human-realistic text output. But GPT-3 didn't come from a company with a large and proprietary dataset. Instead, the dataset consisted of:
- 410 billion tokens from the public Common Crawl (60% weight)
- 19 billion tokens from Reddit submissions with a minimum score of 3 (22%)
- 12 billion tokens from "Books1" and 55 billion from "Books2", which are probably books downloaded from the Library Genesis archive (a pirated ebook dataset) (8% each)
- 3 billion tokens from Wikipedia (3%)
The company behind GPT-3, OpenAI, hasn't released the non-public datasets or model, but they are all trivially recreated without much issue.
You only need about $12 million to train the model from scratch. And you don't even need to do that anymore.
There are plenty of open-sourced models to pick from. There's GPT-J from a set of independent researchers. Meta is open-sourcing OPT-175B. Stanford researchers are open-sourcing their model, Diffusion-LM.
LLMs are quickly becoming a complement to the best services. And the best companies commoditize their complement.
How will companies compete in applications of LLMs where the training data is undifferentiated?
API providers like OpenAI can partner with companies with access to differentiated data, e.g., GitHub (but why not do it themselves?).
However, the value isn't often in the data but rather what happens around the data. Content aggregators that surface the best-generated output. Developer tools augmented with LLMs.
Commoditization could also flip the balance of software and hardware: if LLMs, the data and software, become commoditized, the hardware to train and serve them will be differentiated (commoditize your complement).