Geiger counters, spacecraft equipment, and sensitive modern particle detectors use steel produced before 1945. Why? Because steel today is contaminated with low traces of nuclear fallout. The steel is hard to find: it’s usually salvaged from ships (shipwrecks or scrapping).
Today’s datasets are getting contaminated in a different way. AI models are trained on the Internet. More and more of that content is being generated by AI models. Output from AI models is relatively undetectable. Finding training data unmodified by AI will be tougher and tougher.
Already, workers in Mechanical Turk (an AWS service often used to farm out data labeling to humans) are using LLMs to do their tasks.
Data generated by AI, labeled by AI and used to train future AI. It’s a cycle that will only accelerate as the models become more useful.
Is data that is generated by AI bad for training?
The train-generate feedback loop might amplify specific model (or human) characteristics. Maybe human data contains more outliers. It could lead to degradation in model performance — or it could provide a nearly endless source of training tokens for future models.
But it could also provide higher-quality data. Has the rise of automated spell-checkers and grammar assistants made the latest models worse or stymied progress? It could also have the opposite effect: human-only datasets (e.g., Reddit) might become less valuable — filled with mistakes, bias, and other things we don’t want to capture in our future models.