The copyright conundrum in artificial intelligence age
Generative AI pits innovation against intellectual property, but practical solutions remain elusive.
The rise of generative AI tools has reignited longstanding debates about copyright law, ownership, and innovation. In a recent podcast, Pamela Samuelson, Richard M. Sherman Distinguished Professor of Law at UC Berkeley, delved into the intricate challenges posed by AI systems to existing intellectual property regimes. Samuelson, a pioneer in digital copyright and co-founder of the Authors Alliance, laid bare the practical difficulties facing regulators, creators, and AI developers alike.
At the heart of the issue lies the question of data provenance and transparency. Generative AI models are typically trained on vast datasets, often comprising billions of works scraped from the internet. Many policymakers, particularly in Europe under the proposed AI Act, are pushing for mandatory disclosure of copyrighted works used in training datasets. Yet, as Samuelson argues, such measures assume an overly simplistic view of the AI landscape.
The problem of scale and feasibility
AI training datasets are colossal, often incorporating publicly available internet data. Major corporations like Google and Meta may comply with stringent transparency rules, but Samuelson highlights that AI development extends far beyond Silicon Valley giants. Small startups, non-profits, and even independent researchers depend on Open Source datasets, such as Common Crawl, to build their models. Requiring them to retain and disclose precise records of every data source is impractical and stifles competition and innovation.
Furthermore, the training process itself complicates matters. AI models do not reproduce copyrighted works; they tokenize data into abstract numerical representations—akin to disassembling a LEGO battleship and using the bricks to build an Eiffel Tower. The input data ceases to exist in a recognizable form, rendering claims of direct copyright infringement tenuous at best. As Samuelson explains, generative AI “learns” patterns from datasets rather than replicating the underlying content, drawing parallels to Renaissance artists studying hands to improve their craft.

Licensing: A difficult sell
Collective licensing has been touted as a potential solution to compensate authors whose works are used in AI training. Europe, with its robust history of collective licensing for music and publishing, views this mechanism as viable. However, Samuelson outlines why this approach falters in the AI context.
The sheer volume of data—billions of works, many with negligible commercial value—makes calibrating payments nearly impossible. Imagine a collecting society attempting to distribute fractions of cents to millions of authors; the administrative costs would likely outweigh the actual payouts. Additionally, licensing regimes presuppose a clear distinction between inputs and outputs, but AI models often discard the training datasets after learning, further complicating compensation claims.
More fundamentally, mandating licenses for internet-crawled data risks setting a dangerous precedent. For years, web crawling has operated within legal boundaries, underpinning innovations like search engines. A sudden shift to mandatory licensing could retroactively criminalize commonplace practices, creating uncertainty for developers and chilling technological progress.
The Question of authorship
On the output side, AI-generated works raise questions about authorship. Can AI be recognized as the author of a creative work? Samuelson unequivocally dismisses this notion. U.S. copyright law, she explains, requires human creativity as a prerequisite for protection—a principle reaffirmed by the Supreme Court. However, she acknowledges edge cases: if a human provides detailed prompts and iteratively refines an AI-generated work, the resulting output might meet the threshold for authorship.
This distinction is particularly salient for industries like film and music, where computer-generated content has long coexisted with human creativity. Hollywood studios, for instance, leverage CGI to enhance visual storytelling, but still claim copyright over the final product. As Samuelson notes, rigid policies that disqualify AI-assisted works risk undermining industries that have seamlessly integrated technology into the creative process.

Towards a balanced framework
Samuelson’s insights underscore the need for nuanced, practical regulations that reflect the realities of AI development. While transparency and compensation are valid concerns, solutions must balance the interests of creators with the imperative to foster innovation. Excessive regulation risks entrenching incumbents and marginalizing new entrants, stifling the very competition that drives technological advancement.
Europe’s AI Act may offer a glimpse of what’s to come: a blanket requirement for transparency without imposing crippling compliance burdens. Yet, as Samuelson cautions, policymakers must resist the temptation to anthropomorphize AI or impose solutions better suited to traditional industries.
Generative AI represents a transformative leap in technology—a tool that, much like the printing press or photography, will reshape creative industries. Rather than seeing AI as a threat, Samuelson advocates for recognizing its potential to empower human creators. The task for regulators, then, is to craft policies that encourage innovation while ensuring creators are fairly valued in this new digital era.
Catch the whole podcast episode or check out the transcript.
