The Dataset You Used Might Be Someone's Lawsuit
Adobe's legal troubles over AI training data reveal a troubling pattern: the open-source datasets developers rely on may carry hidden copyright liabilities that could surface years later.
The lawsuit against Adobe filed in December looks, at first glance, like another predictable chapter in tech's ongoing copyright wars. Author Elizabeth Lyon claims Adobe trained its SlimLM language model on pirated versions of her work. Adobe, like dozens of companies before it, will likely argue fair use, transformative purpose, the usual defenses. What makes this case worth your attention isn't the lawsuit itself—it's what the lawsuit reveals about the infrastructure we've built our AI ecosystem on.
Adobe didn't go hunting for pirated books. They used SlimPajama-627B, described by its creators at Cerebras as a "deduplicated, multi-corpora, open-source dataset." SlimPajama was derived from RedPajama. RedPajama contained Books3. Books3, as it turns out, is a collection of 191,000 books—many of them pirated—that has become something like toxic waste buried in the supply chain of modern AI development.
The Provenance Problem
Here's where it gets uncomfortable for developers. According to the lawsuit, Lyon's work appears in a "processed subset of a manipulated dataset." Adobe used a dataset that was created from another dataset that contained a third dataset—and somewhere in that chain of derivation, someone's intellectual property ended up as training data without permission or compensation.
This isn't a problem unique to Adobe. In September, according to NPR, Anthropic agreed to pay authors $1.5 billion to settle claims that it used pirated materials to train Claude—the largest copyright settlement in U.S. history. Apple faces a similar lawsuit filed in September over Apple Intelligence. Salesforce was sued in October. The pattern is clear: companies used what appeared to be legitimate open-source datasets, and those datasets contained Books3 or similar collections of questionable provenance.
The technical elegance of these datasets—deduplicated, processed, optimized—obscured a fundamental question that nobody wanted to ask too loudly: where did this data actually come from?
Fair Use Isn't a Magic Shield
Many developers I talk to assume fair use will protect AI training. The reasoning feels intuitive: the model learns patterns, it doesn't copy books wholesale, surely that's transformative enough. But the legal reality is more textured than that.
The U.S. Copyright Office released guidance in May 2025 suggesting that whether AI training constitutes fair use is "a matter of degree." According to Skadden's analysis of the report, when a model produces content that "shares the purpose of appealing to a particular audience" with the original work, that use is "at best, modestly transformative." The Copyright Office recommended developing licensing markets rather than relying on fair use as a blanket justification.
Meanwhile, in the Anthropic case, Judge Alsup found that while training Claude on copyrighted works might qualify as fair use, storing those books in a central library only qualifies if the copies were legally obtained in the first place. This distinction matters. It means the provenance of your training data—not just how you use it—could determine your legal exposure.
What This Means for Your Work
If you're building with AI or shipping products that incorporate models, the implications are direct. Adobe's SlimLM is described as "a small language model series optimized for document assistance tasks on mobile devices." It's exactly the kind of practical, narrowly-scoped implementation that seemed like a safe bet—using established open-source datasets for legitimate business purposes. Yet here we are.
The lawsuit's chain of derivation—SlimPajama from RedPajama from Books3—suggests that dataset lineage is now something you need to audit. When you pull a model from Hugging Face or use a pre-training dataset, you're inheriting not just the technical artifacts but potentially the legal liabilities embedded in that data's history.
This isn't theoretical. According to TechCrunch, "such lawsuits have, by now, become somewhat commonplace." The Salesforce lawsuit alleges the company used "thousands of allegedly pirated books in training datasets" including RedPajama and The Pile. These are datasets that have been widely used across the industry, treated as standard resources.
The question isn't whether your company will face scrutiny over training data provenance. It's when.
The Hidden Costs
What strikes me about this moment is how it mirrors other infrastructure crises we've seen in software. Remember when everyone discovered that critical open-source libraries were maintained by one person in their spare time? Or when supply chain attacks revealed how little we knew about our dependencies?
This is that, but for data. We've built an enormous amount of AI capability on datasets whose provenance we didn't adequately verify. Some developers assumed that "open source" meant "legally safe." Others knew better but decided the risk was acceptable given industry norms. Now those norms are shifting, retroactively.
For developers, this creates several immediate concerns. Model selection now requires legal due diligence, not just technical evaluation. Data pipeline design must include provenance tracking and documentation. Compliance requirements are emerging that many engineering teams aren't equipped to handle. And perhaps most significantly, the models you deployed last year might carry legal risks you didn't account for in your threat modeling.
What Responsible Practice Looks Like
I don't have clean answers here, and I'm suspicious of anyone who claims they do. This is genuinely uncertain legal territory, and the lawsuits still working through courts will shape what becomes standard practice. But some principles are emerging.
First, dataset provenance matters as much as dataset quality. Before using training data, understand its lineage. If it contains Books3, The Pile, or similar collections of ambiguous provenance, document that risk. Second, consider licensing. According to the Copyright Office's May report, developing licensing markets is their recommended path forward. Some companies are already negotiating directly with publishers and rights holders. Third, stay informed about your dependencies. If you're using pre-trained models, track which datasets they were trained on. When lawsuits identify problematic datasets, assess your exposure.
And finally—and this is harder to quantify—build institutional knowledge about these issues. Your legal team needs to understand how training data works. Your engineering team needs to understand copyright implications. The gap between those domains is where risk accumulates.
The Larger Pattern
The Adobe lawsuit is one data point, but the pattern it's part of is larger. Anthropic's $1.5 billion settlement wasn't just expensive—it was, as the Authors Guild noted when the settlement received preliminary approval in September, a signal that these claims have merit. The fact that similar cases are proceeding against Apple, Salesforce, and now Adobe suggests that courts take these arguments seriously.
For developers, this is the moment to get ahead of the curve rather than waiting for your company to become the next headline. Review your AI implementations. Document your training data sources. Have honest conversations with your legal and compliance teams about exposure. The dataset you used yesterday might be the liability that surfaces tomorrow.
Elizabeth Lyon's lawsuit against Adobe isn't just about one author's books or one company's language model. It's about an entire ecosystem built on assumptions about data that are now being tested in court. The open-source ethos that made rapid AI development possible collided with intellectual property law, and we're still figuring out what emerges from that collision.
What we do know: the era of treating training data provenance as someone else's problem is over. If you're building with AI, it's your problem now.