AI Generated Slop Threatens Future AI Models
The Rising Tide of AI Generated Slop
The swift emergence of ChatGPT and similar generative AI tools has flooded the internet with low quality content. This digital "slop" is now significantly hindering the progress of new AI models. AI models rely on vast amounts of data for training. As AI generated content mixes with human created information, these models increasingly learn from and replicate artificial, second hand AI outputs.
The Danger of Model Collapse
If this cycle continues, AI development could suffer significantly. The quality of AI generated content will decline, poorly imitating its intended human like output, and the AI models themselves may effectively become less capable. This phenomenon is known in the industry as AI "model collapse."
The Quest for Clean Data A Modern Dilemma
Consequently, data created before ChatGPT became widespread is now incredibly precious. A recent report by The Register draws a parallel to "low background steel." This refers to steel manufactured before the first nuclear bomb detonations in 1945. Similar to how AI chatbots have contaminated the internet, nuclear explosions released radioactive particles that affected nearly all steel produced since. This contamination makes modern steel unsuitable for certain sensitive scientific and medical uses. Interestingly, a primary source of this pre contamination low background steel today comes from World War I and World War II era battleships, such as the German fleet scuttled in 1919.
Maurice Chiodo, a researcher at the University of Cambridge, highlighted the significance of this old steel, calling the admiral's decision to scuttle the fleet "the greatest contribution to nuclear medicine in the world." Chiodo explained to The Register, "That enabled us to have this almost infinite supply of low background steel. If it weren't for that, we'd be kind of stuck. So the analogy works here because you need something that happened before a certain date." He further noted, "But if you're collecting data before 2022 you're fairly confident that it has minimal, if any, contamination from generative AI. Everything before the date is 'safe, fine, clean,' everything after that is 'dirty.'"
The Call for Clean Data and Fair Play
In a 2024 research paper, Chiodo argued for the necessity of "clean" data. This is crucial not just to prevent model collapse but also to maintain fair competition among AI developers. Without it, early AI creators who contributed to internet pollution could unfairly benefit from having exclusive access to purer training datasets. The immediacy of model collapse due to data contamination is still debated. However, numerous researchers, Chiodo among them, have been voicing concerns for years. Chiodo told The Register, "Now, it's not clear to what extent model collapse will be a problem, but if it is a problem, and we've contaminated this data environment, cleaning is going to be prohibitively expensive, probably impossible."
Current Impacts and Future Challenges
This problem is already evident in retrieval augmented generation or RAG. This technique allows AI models to access real time internet data to update their knowledge. However, this new data might also be AI generated, and studies indicate this can lead to chatbots generating more "unsafe" content. This issue also ties into the larger discussion about AI scaling, which involves enhancing AI by feeding it more data and using more processing power. When OpenAI and others saw less improvement with new models in late 2024, some experts suggested scaling had reached its limit. If the available data is increasingly low quality, this "wall" will be even harder to overcome.
Navigating Regulation and Industry Stance
Chiodo suggests that stronger regulations, such as mandating labels for AI generated content, could help mitigate this pollution. However, enforcement would be challenging. The AI industry's resistance to government oversight on issues like copyright could ultimately be self detrimental. Rupprecht Podszun, co author of the 2024 paper with Chiodo, told The Register, "Currently we are in a first phase of regulation where we are shying away a bit from regulation because we think we have to be innovative. And this is very typical for whatever innovation we come up with. So AI is the big thing, let it go and fine."