May 15, 2024

The news Monday that Google is planning to launch an update to its search engine that includes answers generated by artificial intelligence likely has news publishers, content creators and all kinds of destination websites panicking about a traffic collapse. News publishers may find they have plenty of allies in urging Congress to act and joining lawsuits against the Big Tech companies.

Still, The Washington Post predicts “carnage” as the old model — where human eyeballs look at content monetized by advertising — will end as people no longer go to particular websites for information.

“The days when it mattered whether a company was third or fourth on the Google search page are over because AI agents will scrape all of the web to get results,” said Toshit Panigrahi, co-founder of a licensing firm called TollBit, which is working with publishers to monetize their data.

Publishers have been bracing for this news for some time.

As they negotiate licenses with companies that have built large language models, they are starting to think about how to assign a dollar value to their news. There are three parts to this problem:

  1. Understanding what can be licensed
  2. Setting a price
  3. Getting the companies to agree to pay (possibly the hardest)

Publishers are trying to figure out what content is useful and how to tailor their content to the new prompts and a world in which grounded data (i.e. data that is grounded in information that has been verified) and up-to-date verified information are required. For example, the Axel Springer agreement with OpenAI isn’t just about selling the rights to use the Axel Springer archive — it will require Axel Springer to provide summaries, based on content in its publications, in response to ChatGPT prompts.

Pricing remains in question. Deals between news companies and OpenAI have been reported (including Le Monde, Financial Times and Axel Springer), but the amounts are generally not disclosed, so it’s hard for publishers to know exactly how they should value their offerings.

Publishers hope that data is becoming more valuable

In this new world, a number of publishers, their lawyers and startups are trying to figure out how to retain value. Reliable large language models need quality content, and it’s customary (or should be) for reputable companies to pay for training data, they argue. Indeed, The New York Times recently reported that OpenAI was so desperate for content that it scooped up YouTube data in violation of copyright rules.

“Data is getting more valuable. The large language models are running out of good data to scrape. This should put pricing in the hands of publishers,” said one founder whose licensing startup is still in stealth mode.

A handful of companies are emerging that hope to help publishers identify when their content has been used and charge for it. TollBit, which former CNN anchor and Facebook News lead Campbell Brown joined as a senior adviser in April, provides analytics and visibility to websites about which pages are being accessed by AI bots, and how often, to help inform their pricing and negotiations. Patronus, meanwhile, has launched its copyright catcher, targeted at companies that want to know if they are using copyrighted text.

“There is a small window of opportunity. An industry that was at the mercy of social media algorithms and oligopoly of search and social media is now seeing the balance shifting slightly,” said Panigrahi, who founded TollBit with Olivia Joslin, a fellow alumnus of restaurant software company Toast.

Thinking of generative AI as a powerful web scraper demystifies what it does, Panigrahi said. “They are really good at taking text and structuring it into answers.”

“How are you prepping for a world when the visitors are not human eyeballs — they are AI agents? We are building new infrastructure, a new class of product to get an AI agent to start paying for your content,” Pangirahi said he tells publishers.

Hopes that publishers can make lots of money from licensing content are viewed with skepticism by those who say that words are just tokens and that each one has a small value, as billions are needed. “It’s like book royalties. The royalty paid to an author often changes depending on the volume sold. So the rate the author receives goes down as the volume goes up,” said media management consultant Ava Seave.

While AI companies might claim their use lies well within the fair use doctrine, that seems questionable. The fair use doctrine was never intended to deal with AI and its gobbling up masses of data — all of the data from, say, a newspaper — and then regurgitating something that embeds that data in ways that are not transparent about what the AI has hoovered up. There have been instances in which AI models come forth with whole passages, with or without attribution, and whole passages with incorrect attributions.

In analyzing valuations, the old expression common in the early days of computer development comes to mind: garbage in, garbage out, meaning that the quality of any computer analysis depends on the quality of what is fed to it. If the facts that are analyzed are wrong, even if the analysis is correct, the conclusions will be wrong.

There is no way that something meaningful can emerge from data that is garbage; the most that can emerge are statements about how that data seems inconsistent with data that the AI training has swept up earlier. Thus, AI depends not just on data, but quality data, posing a further challenge to valuation: well-researched and fact-checked data, say from The New York Times, should, in any good system of valuation, have more value than the made-up data of Truth Social.

Some argue that it’s absurd for companies to build large language models without paying for training data. Others reply that once you start paying for data, where do you stop? It’s just not how the internet works.

And who is to determine whether publishers or journalists should get it? This question has already surfaced in Brazil, where musicians asked for a share of residuals, and in Belgium, where European copyright directives call for creators to get paid while publisher negotiations with Google don’t. Publishers feel strongly that they incur the costs and risks and therefore payments should go to them.

To a large extent, these arguments are distractions that benefit those who are resisting paying for quality information. Such issues already arise, say, in the payment for music, where both the artists and their publishers have worked out arrangements.

How publishers could value content

There are many ways to calculate the dollar amounts that should be paid using some well-defined and widely accepted principles.

Haaris Mateen, assistant professor at the University of Houston, said that the share price movements of tech companies with stakes in generative AI may give some indication of how the market values large language models. But that just gives a broad sweep of potential revenues and profit. It doesn’t provide a formula to establish the value of news articles or other words used as inputs.

Jeremy Gilbert, the Knight Chair in Digital Strategy at Medill, noted that newsrooms don’t track the real cost of producing news. “This leads to undercounting the reporting and research time involved in stories and overly focusing on writing time,” he said. “Centralized functions like strategy work around subscriptions, marketing, comments and newsletters are rarely counted when thinking about the production costs. And high-profile stories can involve travel, permits, fixers, translators, specialized equipment and much more.”

One idea is that publishers calculate the cost of producing news and could be paid based on that figure. This approach might be criticized as valuing the output by the cost of the input.

Publishers and finance people also said they are looking at their CPM (or the cost of advertising to 1,000 viewers) rates as a basis of comparison or how much they stand to lose by being disintermediated by large language models. This approach can be criticized as valuing the input into the model by the costs imposed on legacy media, rather than how important the news is to the model.

Another publisher said he’s trying to figure out if there is a way to attribute how much news content users of large language models are actually surfacing, which is similar to the methodology of the FehrAdvice & Partners AG study that found that the presence of news contributes to the overall attractiveness and audience of Google and Facebook. During the negotiations around the C-18 bill in Canada in 2023, others looked at the revenue Google generated in Canada and settled on 4% of that being given to publishers.

Such approaches are on sounder theoretic bases: They attempt to estimate what the language model would be valued at with and without quality news. For uses of large language models related to current events, one might argue that all the value is associated with quality news input; with older uses, related to past events, much of the value of the archives could be supplemented by subsequent historical analyses.

In February, Bloomberg reported that Reddit had reached a deal with OpenAI, valuing Reddit’s content at $60 million and allowing it to be used as training data. Some say that Reddit content is low quality and that news publishers should get more than that. Others maintain that in this new world, the quality of the information will not determine the price, and the volume provided by Reddit is significant.

Economists say it’s possible to come up with a framework for pricing. But based on the lessons learned from trying to get Google and Facebook to pay publishers, Mateen noted the sticking points are Big Tech’s refusal to provide the necessary data, as well as their refusal to pay. If just a few companies dominate the market then it becomes a question of monopoly power, not a question of how to do a valuation.

“The revenue model would require, as a starting point, that the LLMs acknowledge that use of the content is entitled to compensation in the first place. That’s before you get to the issue of price per view on a piece of content,” said a major digital U.S. publisher.

The question then is how you get the AI companies to pay, especially when they’ve already taken content for training purposes. Property rights are socially determined and power matters. Australian publishers were only able to get Google and Meta to pay for news because the government stepped in to rectify the power imbalance between Big Tech and publishers — and because the 2021 News Media Bargaining Code got support across the political spectrum, so it was not just Rupert Murdoch who supported it, but also Green Sen. Sarah Hansen-Young.

What impact could legal reforms and collective bargaining have?

Economists argue that collective bargaining would help publishers get a better deal, but the chances of it happening appear slight as the large publishers are already signing deals on their own. This would change if the legal framework moved either to require negotiations (as happened in Australia with the bargaining code) or to allow publishers to band together without being accused of anticompetitive behavior.

Earlier laws, written before generative AI became widespread, couldn’t have dealt with the range of questions it has posed. Legal conflicts naturally arise when trying to interpret current wording in a new context. IAC and Expedia Group chair Barry Diller has called for changes to the definition of fair use. What was intended to be a law that shared information for educational purposes has morphed into a shield that allows large language models to take whatever they want, he argued. There needs to be a clarification: Firms cannot simply scrape all the data that’s available without paying, and can’t regurgitate anything without paying more.

Similarly, Axel Springer CEO Mathias Dofner has called for a new legal framework for copyright and said“if we don’t achieve it collectively, we will be lost. If we stick together we have a real opportunity”

Copyright law is intended to ensure that the producer of written material gets full compensation for the economic value of what is produced. “Fair use” in academia can be viewed as a recognition that further research is always based on earlier research, and in that context, very, very limited use of prior “words” is permissible. Fair use promotes research and lowers what would otherwise be high transaction costs, with no significant impact on incentives for writing.

This rationale for fair use does not apply to AI, particularly in the case of journalism: there is the threat that it could take away all of the value of what is produced. Doing that will reduce what is produced. And then all that will be produced by the AI will reflect the old saying, “garbage in, garbage out.” Society will be worse off. The ability of media companies to produce quality journalism is tied to their receiving compensation for the costs that go into producing such journalism.

A 2023 paper noted that “there is no one-size-fits-all fix to this problem. Copyright licensing and collective bargaining schemes will likely need to be combined with other policy solutions (such as transparency and antitrust solutions) to be effective. In the European Union, the transposition of copyright directives may provide some support for publishers.

In the U.S., lawsuits are how publishers protect themselves. “Disney’s images don’t get stolen because they sue and they have been suing for decades,” said Mateen.

Even when creators or publishers lose, lawsuits raise the costs for companies using content without paying for it. One music executive noted that there is plenty of legal precedent for crawling, indexing and training but that companies sometimes pay for licenses because it’s easier than fighting. And if they pay for some part of the content, they can get even more creative when they use it

“YouTube won the Viacom case but they still ended up licensing. It’s easier to pay for licensing than to be in constant conflict,” said one Silicon Valley-based executive.

Establishing a rough methodology for compensation, widely viewed as fair, may be better than the rough and tumble of endless litigation–which raises time and financial costs for everyone. An agreement regarding fair valuation would not only reward words ingested, but the quality of the words (perhaps roughly measured as the quality of relevant costs in producing them). It could also include penalties for misquoting or misrepresenting, as charged in the current New York Times lawsuit against OpenAI.

The age of AI threatens to further weaken the link between who benefits from the value produced by journalism (the large language models) and those who produce the content. Freeriding off the content produced by journalists and publishers will ultimately weaken both journalism and the quality of the large language models.

Mateen noted that this problem is not going to be solved by changing the business model or selling subscriptions.

“Why would someone subscribe to content when they can see a summary of it through an LLM at a fraction of the cost?”

Joseph E. Stiglitz contributed to this piece.

Support high-integrity, independent journalism that serves democracy. Make a gift to Poynter today. The Poynter Institute is a nonpartisan, nonprofit organization, and your gift helps us make good journalism better.
Donate
Anya Schiffrin is a senior lecturer at Columbia University’s School of International and Public Affairs and writes regularly on the bargaining codes.
Anya Schiffrin

More News

Back to News