AI Models Can't Hide: Finetuning Exposes the Copyright Cracks
Finetuning AI models reveals an industry-wide vulnerability, allowing them to recreate copyrighted texts. This flaw challenges claims of data protection and redefines the legal landscape for AI developers.
In the tug-of-war between AI technology and copyright law, a new contender has entered the ring: finetuning. Recent research shows that finetuning can force large language models (LLMs) to regurgitate copyrighted texts, challenging the industry's assurances of data protection. This isn't just an oversight. It's a fundamental flaw that questions how well these models keep their promises to courts and regulators.
The Vulnerability of Finetuning
Let's break it down. Companies like OpenAI, Anthropic, and DeepMind have long claimed their models don't store training data. They've also touted safeguards such as Reinforcement Learning from Human Feedback (RLHF) and output filters to prevent any verbatim reproduction of copyrighted material. However, when models like GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 were finetuned, they managed to recreate 85-90% of withheld copyrighted books using only semantic cues, with verbatim sequences surpassing 460 words. This isn't a minor slip-up. It's a gaping hole in the industry's security blanket.
Industry-Wide Implications
The research goes further, demonstrating that when finetuned with works by a single author, such as Haruki Murakami, these models could recall verbatim texts from more than 30 unrelated authors. This isn't an isolated incident. It's a pattern. Finetuning with random author pairs or public-domain texts led to similar extraction levels, while synthetic texts didn't. This indicates that finetuning on specific authors reactivates pre-existing memorization.
It turns out, three major models memorize the same books in identical regions with a correlation coefficient greater than 0.90. This suggests a shared vulnerability across the industry. Are we witnessing a convergence of security failures in the AI landscape that the industry can't easily dismiss?
Legal and Ethical Crossroads
These findings pose significant questions. If models are storing and reproducing copyrighted works, does this undermine recent fair use rulings? Courts have previously sided with AI developers, assuming their measures against reproduction were sufficient. Yet, finetuning reveals these measures may not be as foolproof as advertised.
For AI developers, this is more than a technical glitch. It's an ethical and legal quagmire. The AI-AI Venn diagram is getting thicker, and at its intersection lies the need for better solutions to protect both creators' rights and technological progress. Will the industry respond with genuine fixes or mere lip service?
As we build the financial plumbing for machines, the stakes are too high for complacency. The autonomy of AI models demands accountability and innovation, not only in how they learn but also in how they remember.
Get AI news in your inbox
Daily digest of what matters in AI.