Why AI Chatbots Lie and How We Can Fix It
The Problem of Confident AI Hallucinations
A recent research paper from OpenAI delves into a persistent issue with large language models (LLMs) like GPT-4: hallucinations. OpenAI defines these as “plausible but false statements generated by language models.” In a blog post summarizing their findings, the organization admits that despite significant progress, these fabrications remain a “fundamental challenge” that may never be fully eradicated.
To highlight the problem, the researchers shared a telling experiment. They prompted a widely used chatbot to name the Ph.D. dissertation of Adam Tauman Kalai, one of the paper's authors. The chatbot provided three different, confident answers—all of them incorrect. When asked for his birthday, the model again offered three distinct dates, none of which were right. This raises a critical question: how can an AI be so consistently wrong yet sound so sure of itself?
Tracing Hallucinations to Training Incentives
The researchers suggest that hallucinations stem partly from the model's pretraining process. LLMs are trained to predict the next word in a sequence based on vast amounts of text data. This data consists only of "positive examples of fluent language," without any labels to distinguish fact from fiction. The model's job is simply to approximate the overall distribution of language.
"Spelling and parentheses follow consistent patterns, so errors there disappear with scale," the paper notes. However, "arbitrary low-frequency facts, like a pet’s birthday, cannot be predicted from patterns alone and hence lead to hallucinations."
While pretraining plays a role, the paper's proposed solution focuses more on the evaluation process. The authors argue that current evaluation methods don't directly cause hallucinations but do "set the wrong incentives," effectively encouraging models to lie.
A New Approach to AI Evaluation
The researchers draw an analogy to multiple-choice tests where there's no penalty for wrong answers. On such a test, a student is incentivized to guess on every question because "you might get lucky and be right," whereas leaving an answer blank "guarantees a zero."
Current AI evaluations operate similarly. "When models are graded only on accuracy, the percentage of questions they get exactly right, they are encouraged to guess rather than say ‘I don’t know,’” the paper explains.
The proposed fix is to adopt a scoring system similar to tests like the SAT, which may include "negative [scoring] for wrong answers or partial credit for leaving questions blank to discourage blind guessing." For AI, this means evaluations must "penalize confident errors more than you penalize uncertainty, and give partial credit for appropriate expressions of uncertainty."
The researchers stress that this change must be fundamental. It isn't enough to add a few new uncertainty-aware tests on the side. Instead, the "widely used, accuracy-based evals need to be updated so that their scoring discourages guessing." The conclusion is clear: "If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess."