Good data is the difference between Mistral’s LLMs and Llama, which share similar architectures but different datasets.
To train LLMs, you need data that is:
- Large — Sufficiently large LMs require trillions of tokens.
- Clean — Noisy data reduces performance.
- Diverse — Data should come from different sources and different knowledge bases.
What does clean data look like?
You can de-duplicate data with simple heuristics. The most basic would be removing any exact duplicates at the document, paragraph, or line level. More advanced versions might look at the data semantically, figuring out what data should be omitted because it’s better represented with higher quality data.
The other dimension of clean data is converting various file types to something easily consumed by the LLM, usually markdown. That’s why we’ve seen projects like nougat and donut convert PDFs, books, and LaTeX to better formats for LLMs. There’s a lot of training data that’s still stuck in PDFs and human-readable but not so easily machine-readable data.
Where does diverse data come from?
The surprising result of the success of the GPTs is that web text from the Internet is probably one of the most diverse datasets out there. It contains usage and data that aren’t found in many other data corpora. That’s why models tend to perform so much better when they’re given more data from the web.