BRAVE NEW WTF: When AI Is Trained on AI-Generated Data, Strange Things Start to Happen.
Though whether it sticks in the long term remains to be seen, at least for the time being generative AI seems to be cementing its place in our digital and real lives. And as it becomes increasingly ubiquitous, so does the synthetic content it produces. But in an ironic twist, those same synthetic outputs might also stand to be generative AI’s biggest threat.
That’s because underpinning the growing generative AI economy is human-made data. Generative AI models don’t just cough up human-like content out of thin air; they’ve been trained to do so using troves of material that actually was made by humans, usually scraped from the web. But as it turns out, when you feed synthetic content back to a generative AI model, strange things start to happen. Think of it like data inbreeding, leading to increasingly mangled, bland, and all-around bad outputs. (Back in February, Monash University data researcher Jathan Sadowski described it as “Habsburg AI,” or “a system that is so heavily trained on the outputs of other generative AI’s that it becomes an inbred mutant, likely with exaggerated, grotesque features.”)
It’s a problem that looms large. AI builders are continuously hungry to feed their models more data, which is generally being scraped from an internet that’s increasingly laden with synthetic content. If there’s too much destructive inbreeding, could everything just… fall apart?
Yes.
On the flip side, though, is this: AI’s real problem is that it’s boring.
Indeed, as anyone who has spent any time playing with ChatGPT can tell you.