Robots have a long history of replacing humans in the workplace. Automobile workers, elevator operators, bank tellers, gas station attendants, even grocery store checkout clerks have felt the squeeze.
With ground-breaking artificial intelligence advances, reporters at PolitiFact had to wonder … are the fact-checkers next?
Artificial Intelligence, or AI, has exploded into relevance in the past few months after ChatGPT’s public rollout. As people experimented with the new tool, it prompted concern about AI’s potential to make us humans obsolete. PolitiFact wanted to conduct an experiment — could ChatGPT fact-check better than the professionals?
We selected 40 PolitiFact claims from across a range of subjects, Truth-O-Meter ratings, and claim types and put AI to the test. Using a free ChatGPT account, we asked two questions per claim and kept detailed track of its answers.
Evidence suggests fact-checkers can breathe a sigh of relief. Our test results reveal that AI is not yet a reliable fact-checking tool. It lacks contemporary knowledge, it loses perspective, and tells you what you want to hear and not always the truth. But some researchers are hoping to harness AI’s power to help fact-checkers identify claims and debunk the ever-growing pool of misinformation.
Sometimes ChatGPT got things right
ChatGPT is a type of AI called a “large language model” that uses huge amounts of data to understand language in context and reproduce it in novel ways. “They have basically gulped all of the information that they can and then are trying to put that information together in cohesive ways that they think you want,” said Bill Adair, journalism professor at Duke University and founder of PolitiFact, who has been researching how AI can be used in fact-checking work.
It does this seemingly miraculous task by using a series of probabilities to predict the next word in a sentence that would be most helpful to you, said Terrence Neumann, a doctoral student researching information systems at the University of Texas. These models can then be fine-tuned by people to provide the ideal response.
For a few of the claims we tested, it worked seamlessly. When asked about a claim by Sen. Thom Tillis, R-N.C. regarding an amnesty bill by President Joe Biden, it assigned the same Half True rating that PolitiFact did and explored the nuances we shared in our article. It also thoroughly debunked several voter fraud claims from the 2020 election.
It corrected a misattributed quote by President Abraham Lincoln, confirmed that Listerine was not bug repellent and asserted that the sun does not prevent skin cancer.
But half of the time across the 40 different tests, the AI either made a mistake, wouldn’t answer, or came to a different conclusion than the fact-checkers. It was rarely completely wrong, but subtle differences led to inaccuracies and inconsistencies, making it an unreliable resource.
Frozen in time
ChatGPT’s free version is limited by what is called a “knowledge cutoff.” It does not have access to any data after September 2021, meaning it is blissfully unaware of big global events such as Queen Elizabeth II’s death and Russia’s invasion of Ukraine.
This cutoff impedes ChatGPT’s usefulness. Rarely are people fact-checking events that happened two years ago, and in the digital age there is constantly new data, political events, and groundbreaking research that could change the accuracy rating of a statement.
ChatGPT is mostly aware of its frozen state. In almost every response, it offered some variation of this caveat “As an AI language model, I don’t have real-time information or access to news updates beyond my September 2021 knowledge cutoff.”
It occasionally used this as an excuse to refuse to rate a claim, even when the event happened before September 2021. But sometimes this resulted in more consequential errors. ChatGPT confused a new Ron DeSantis bill with an old bill. It also incorrectly rated a claim about the RESTRICT Act because it had no idea such a thing even existed!
With no citations or links included in the responses, it was hard to know where it was getting its information from.
Newer ChatBots such as Bard and Microsoft’s Bing can surf the web and reply to in-the-moment events, which Neumann said is the direction most are headed.
All over the place
Another challenge? “It’s wildly inconsistent,” said Adair, “sometimes you get answers that are accurate and helpful, and other times you don’t.”
It surfaced different answers depending on how the question was phrased and the order it was asked. Sometimes asking the same question twice resulted in two distinct ratings.
Crucial to understand: ChatGPT is not worried about checking for accuracy. It is focused on giving users the answers they are looking for, said David Corney, senior data scientist at Full Fact, a U.K. fact-checking site. For that reason, the prompt itself can lead to different responses.
For example, we tested two different, but similar claims:
Says Vice President Kamala Harris said, “American churches are PROPAGANDA CENTERS for intolerant homophobic, xenophobic vitriol.”Â
Says Rep. Marjorie Taylor Greene, R-Ga., said, “Jesus loves the U.S. most and that is why the Bible is written in English.”
PolitiFact rated both claims False, as there was no evidence that either woman said such a thing. ChatGPT also found no evidence or record of these statements — but it rated the claim about Harris “categorically” false but refused to rate the claim about Greene because of uncertainty.
ChatGPT would also get random bursts of confidence, switching between finding mixed evidence and making decisive statements. Other times, it would refuse to rate the claim with little to no explanation.
“As they try to produce useful content, they produce content that is not accurate,” Adair said.
Missing the subtlety
Other times, ChatGPT’s Achilles heel was its literalness.
For example, back in July 2021, PolitiFact rated a claim that George Washington mandated smallpox vaccinations among his Continental Army troops Mostly True. While the vaccines did not yet exist, he did order his troops to be inoculated using the contemporary method called “variolation.” But ChatGPT rated the same claim False because smallpox vaccines did not literally exist.
In other instances, it would assign the same rating PolitiFact did, but fail to capture the context of a wider conspiracy, like understanding why a claim about a NASA movie studio would be relevant, or how theories about mRNA codes on streetlights connect to COVID vaccine fears. It also failed to elaborate on why someone may believe the claim, as a journalist might.
Although ChatGPT can sound authoritative, its lack of humanity is clear. It has “no concept of truth or accuracy,” Corney said. “This means that although the output is always fluent, grammatically correct and relevant,” it can make mistakes, sometimes those that a human would never make.
Just plain wrong
ChatGPT would occasionally get it completely wrong. In one test about oil reserves in the Permian Basin, it pulled all the right data, but did the math wrong, leading it to the opposite conclusion. In two other instances, it was completely unaware of a decade-old law banning whole milk in schools, and couldn’t find evidence of a statistic about overdose deaths despite citing the exact study the statistic was from.
Several experts warned of the chatbot’s tendency to “hallucinate” in the sense that it cites events, books and articles that never existed.
“ChatGPT has one crucial flaw, which is that it doesn’t know when it doesn’t know something.” said Mike Caulfield, research scientist at the University of Washington’s Center for an Informed Public. “And so it just will make things up.”
All the experts agreed that ChatGPT is not yet reliable or accurate enough to be used as a fact-checker. The technology is improving, but Corney said it is “extremely challenging” to get an AI to reliably determine or recognize the truth. Current research is working at improving fluency and relevancy, but accuracy remains a bigger mystery.
“They’re gonna get better and we’re gonna see fewer mistakes,” Adair said. “But I think we’re still a long way away from generative AI being a dependable fact checker.”
Use as a tool for fact-checkers
Corney said Full Fact’s developers are working on a variety of AI tools that transcribe and analyze news reports, social media posts and government transcripts to help fact-checkers detect claims that might be ripe for fact-checking. In that scenario, the tool might look for matches of claims that have already been fact-checked. It could also try to verify statistical claims against official figures.
Other groups such as the Duke Reporters’ Lab and Neumann are conducting research on how fact-checkers can incorporate AI as a fact-checking support tool. “I think that it has a lot of potential, like maybe prioritizing certain misinformation for fact checking by fact checkers,” said Neumann.
Caulfield said this technology works best when used by someone who can evaluate the output’s accuracy.
“It’s hard to know when ChatGPT can be relied on, unless you already know the answer to the question!” Corney said. But he worries that the general public, which may not have background information on a given topic, could be easily misled by ChatGPT’s confident response.
And Caulfield assures that even as AI gets better, fact-checkers won’t be totally out of a job. “To the extent that ChatGPT knows anything, is because someone found it and reported on it,” he said. “So, you can’t really replace fact-checkers.”
Phew!
To see all of ChatGPT’s responses to our queries, click here.
This fact check was originally published by PolitiFact, which is part of the Poynter Institute. See the sources for this fact check here.