What's actually used to train these LLMs? A brief look at some of the datasets involved.
LAION-5B
Stable Diffusion was trained on a dataset called LAION-5B ("Large-scale Artificial Intelligence Open Network"), which is comprised of 5.85 billion image-text pairs crawled from the internet. The actual crawled data comes from Common Crawl.
Common Crawl
3.15 billion pages contained in 380 TB. OpenAI's GPT-3 was, in part, trained by the data in Common Crawl. It is a non-profit founded by Gil Elbaz in 2011 (Elbaz founded Applied Semantics, which was acquired by Google in 2003 for $102mm and later became AdSense).
The Pile
A set of 22 smaller datasets was used to train GPT-J.
- A filtered subset of Common Crawl
 - PubMed Central
 - "Books3" a collection of ebooks downloaded from Bibliotik
 - OpenWebText2 – scraped URLs from Reddit with a score of 3 or higher
 - ArXiv
 - GitHub
 - FreeLaw
 - Stack Exchange
 - USPTO Backgrounds
 - PubMed Abstracts
 - Gutenberg
 - OpenSubtitles
 - Wikipedia
 - DM Mathematics
 - Ubuntu IRC
 - BookCorpus2 – a set of 18k books from "somewhere"
 - EuroParl
 - Hacker News
 - YouTube Subtitles
 - PhilPapers
 - NIH ExPorter
 - Enron Emails
 
GPT-3 dataset
The book corpuses used aren't specified in the GPT-3 paper. Most likely because they are from gray hat sources like Bibliotik.
- Common Crawl
 - OpenWebText2
 - Books1 (most likely either Gutenberg)
 - Books2 (most likely BookCorpus?)
 - Wikipedia