AI learning from AI: the photocopy problem

There's a concern that surfaces regularly in AI discussions, usually framed as a photocopy analogy. AI-generated content fills the internet, new models are trained on that content, the output becomes a photocopy of a photocopy, and over time the quality declines into repetitive noise. The concern is real. The analogy is useful. The specific shape of the problem is more nuanced than the headline suggests, and getting the nuance right matters for what we do about it.

The phenomenon has a specific name in the research literature: model collapse. It has been studied in controlled experiments, and the result is consistent. Models trained on samples from their own output tend to lose distributional diversity. Rare words, unusual constructions, cultural references, and minority perspectives disappear first. The output becomes statistically flatter, more homogeneous, more predictable. In the experimental setting, this happens over a small number of generations and is quite dramatic.

Why production is different from the experiments

The reason the experimental result hasn't translated directly into observable collapse in production models is that the major labs are aggressively filtering AI-generated content out of training sets. This is not an optional additional step for frontier model training - it is load-bearing infrastructure. OpenAI, Anthropic, Google, and Meta all employ large teams dedicated to provenance assessment, synthetic content detection, and dataset curation. The filters are not perfect. They're good enough that, measured on perplexity against held-out human-written text, frontier models have continued to improve release over release rather than collapse.

The question isn't whether model collapse is a real phenomenon - the research establishes that cleanly. The question is whether the filtering can scale. Estimates suggest that within five years, the volume of AI-generated text on the web will exceed the volume of human-written text. At that point, filtering becomes genuinely adversarial - you're asking the filter to outperform the generator at detecting itself. The filters will keep improving. The generators will also keep improving. How this race resolves is one of the more interesting open questions in AI training infrastructure, and almost nobody outside the labs is watching it closely.

Who loses first

Even if large-scale collapse is avoided at the frontier, the phenomenon is not uniformly distributed. It's worst in the places that had the smallest amount of high-quality human content to begin with. Languages with fewer native speakers and smaller digital corpora - Welsh, Yoruba, Malay, Sámi - are at meaningfully higher risk than English. The volume of AI-generated content in those languages, relative to human content, hits concerning ratios much faster. The diversity loss that shows up after several generations in the English-language experiments may show up after one or two generations in smaller-language training data.

Cultural minorities inside larger language communities face a similar dynamic. Rare dialects, specific professional jargon, regional vocabularies, and long-tail knowledge areas all have limited human-generated training data, and that limited data is being diluted faster than the general case. The collapse story reads differently for a speaker of a dominant language in a dominant culture than for everyone else, and the uniform 'AI is losing quality' framing flattens out a distribution that's actually quite uneven.

The commercial response

The AI labs have noticed the problem, and the commercial response is visible in the market. Data partnerships are now the hottest category in AI deal-making. Licensing arrangements with news publishers, academic databases, professional content libraries, and government archives have been the subject of quiet multi-hundred-million-pound deals over the last two years. The logic is explicit: if filtering detects AI-generated content, what you want is verifiably-human content sources that are unlikely to be contaminated. Those sources have been leveraged into premium pricing because the labs need them.

This creates a structural market pressure that wasn't previously there. Organisations that hold large, high-quality, verifiably-human content archives are suddenly valuable in a way they weren't before. This includes some unusual candidates: long-running discussion forums, academic publishers, specialist trade press, and even personal email archives have been the subject of serious acquisition conversations. The incentive landscape has shifted in ways that the general AI discussion hasn't caught up with, and it's worth paying attention to as an indicator of which content sources the labs consider still safe.

The collapse story reads differently for speakers of dominant languages than for everyone else.

The audience side of the problem

There's a separate but related effect worth naming, which is that users are increasingly distrustful of AI-generated content even when they can't identify it reliably. The trust decline is steeper than the capability increase, which is an unusual pattern. In most historical technology transitions, user trust tracks capability. In AI content, user trust is declining even as the content quality improves, because the volume of low-effort AI output has trained users to discount anything that feels generic.

If this continues, the model-collapse problem becomes secondary to an audience-collapse problem. Content, regardless of source, is being consumed with less attention than a decade ago. The long-run effect of this on journalism, education, and technical documentation is hard to predict. The short-run effect is that the economic return to producing good content is declining for reasons that are only partly related to AI generation quality, and the people producing that content are correspondingly disincentivised to continue. That's a dynamic that feeds the generated-content problem in a second-order way.

What actually helps

Regulatory responses are starting to converge on mandatory provenance metadata for AI-generated content. If implemented broadly, this makes filtering substantially easier. Provenance metadata becomes the canonical 'is this AI' signal, and the adversarial-detection problem reduces to an enforcement problem. This is being discussed specifically because the collapse scenario is real enough to warrant policy response, even though it's slower-moving than the headline version suggests.

For individual practitioners, the useful actions are smaller and more immediate. If you write, keep writing - and keep writing in ways that carry personal and contextual signatures machines can't easily reproduce. If you're a publisher, pay attention to the provenance and integrity of your archives, because they're now strategic assets. If you're deploying AI in a product, treat content integrity as a first-class engineering concern rather than a future problem. The slow-moving collapse dynamic is patient, and the organisations that treat it as future-tense will find themselves reacting late.

The photocopy metaphor is apt. The quality of the photocopy chain depends on the quality of the original and the fidelity of the copying process. The labs are working hard on the fidelity. The originality - the supply of genuinely human, unique, high-quality content - is the part that's getting quieter, and the part most worth defending.

← Previous

When conversation starts sounding like PowerPoint

What AI copilots can - and cannot - do for engineering organisations