AI Flood Preserving The Internets Original Human Data
The internet as we knew it is undergoing a seismic shift. Since the advent of sophisticated AI like ChatGPT in late 2022, a deluge of AI-generated content has swept across blogs, search engines, and social media. This digital transformation has prompted some researchers to embark on a crucial mission: preserving human-generated content from 2021 and earlier. They liken this effort to the historic salvaging of "low-background steel," a material prized for its purity in a world changed by new technology.
The Nuclear Age Echo: Understanding Low-Background Steel
In the aftermath of World War II, scientists faced an unusual challenge: steel produced after 1945 was subtly contaminated. The atomic era had infused the atmosphere with radioactivity, tainting the metal. This contamination rendered most new steel unsuitable for highly sensitive equipment like Geiger counters. The ingenious solution was to recover pre-war steel from battleships sunken deep in the ocean, far from nuclear fallout. This "low-background steel" became invaluable for its uncontaminated nature.
Today, a similar scenario is playing out in the digital domain. The internet is increasingly saturated with content not crafted by human hands, but synthesized by artificial intelligence. Much like radiation, this AI-generated material is often difficult for the average person to detect, is widespread, and fundamentally alters the digital environment.
The AI Training Dilemma: Model Collapse
This proliferation of synthetic content presents a significant problem, particularly for AI researchers and developers. AI models learn from vast datasets scraped from the web. Historically, this meant learning from the rich tapestry of human expression—data that was messy, insightful, biased, poetic, and sometimes brilliant. However, if current AI models are trained on content generated by previous AI, which in turn learned from even earlier AI-generated text, a phenomenon known as "model collapse" can occur.
Imagine photocopying a photocopy repeatedly. Each new copy becomes fainter, losing detail and clarity. Similarly, AI models trained on their own outputs risk becoming less original, less nuanced, and less connected to genuine human thought. The unique, the quirky, and the truly novel can get lost in this recursive loop.
Will Allen, a vice president at Cloudflare, emphasizes the growing importance of human-generated content from before 2022. This data, he argues, grounds AI models—and society—in a shared, verifiable reality. This grounding is especially critical as AI models are deployed in technical fields such as medicine, law, and finance. "The data that has that connection to reality has always been critically important and will be even more crucial in the future," Allen stated. "If you don't have that foundational truth, it just becomes so much more complicated."
Real-World Examples: The Search for Authenticity
This isn't merely a theoretical concern; practical issues are already emerging. Venture capitalist Paul Graham shared an experience nearly a year after ChatGPT's launch. While searching online for instructions on setting a pizza oven's temperature, he found himself scrutinizing content dates to find information published before the flood of "AI-generated SEO-bait," as he described it on X.
Malte Ubl, CTO of AI startup Vercel and a former Google Search engineer, responded to Graham, noting that he was essentially filtering the internet for content that was "pre-AI-contamination." Ubl drew the same analogy: "The analogy I've been using is low background steel, which was made before the first nuclear tests." Matt Rickard, another ex-Google engineer, echoed this sentiment in a blog post, warning that modern datasets are becoming increasingly contaminated as AI models are trained on AI-generated internet content, making unmodified data harder to find.
Preserving Digital Purity: The Modern Low-Background Steel
Some technologists advocate for preserving digital equivalents of low-background steel: human-generated data from the pre-AI boom era. This data represents the internet's authentic bedrock, created by people with genuine intent and context.
John Graham-Cumming, Cloudflare's CTO and a board member, is one such preservationist. His project, LowBackgroundSteel.ai, is dedicated to cataloging datasets, websites, and media that existed before 2022. One example is GitHub's Arctic Code Vault, an archive of open-source software captured in February 2020. Graham-Cumming's initiative seeks to archive content reflecting the web in its raw, human-authored state, free from LLM-generated filler.
Another resource he highlights is "wordfreq," a project by linguist Robyn Speer to track online word frequencies. Speer ceased updates in 2021, stating in a 2024 GitHub update, "Generative AI has polluted the data." This pollution skews internet data, making it a less reliable reflection of human writing and thought. For instance, an analysis cited by Speer showed ChatGPT's unusual affinity for the word "delve," causing its online frequency to spike unnaturally, a pattern not seen in human usage. A more recent observation is ChatGPT's fondness for the em dash.
The Enduring Value of Our Shared Reality
Cloudflare's Allen acknowledges that AI models trained on synthetic content can boost productivity and reduce tedium. He himself is a regular user of various chatbots. The analogy to low-background steel isn't perfect either, as scientists have found alternative methods for producing pure steel.
Nevertheless, Allen maintains, "you always want to be grounded in some level of truth." The implications extend beyond AI model performance; they touch the very fabric of our shared reality. Just as scientists relied on low-background steel for accurate measurements, we might soon depend on meticulously preserved pre-AI content to understand the authentic state of human thought, reasoning, and communication before machines began to mimic us so proficiently.
The unadulterated internet of the past may be gone, but thankfully, dedicated individuals are working to save copies. Like the divers who salvaged steel from the ocean depths, they remind us that preserving the past is perhaps the surest way to build a trustworthy future.
For more insights, sign up for Business Insider's Tech Memo newsletter here. You can reach out to the original author via email at abarr@businessinsider.com.