Chatbots can 'hallucinate' more often than many believe

When OpenAI, a San Francisco startup, introduced its online chatbot ChatGPT late last year, millions were captivated by the human-like way it answered questions, wrote poetry, and discussed almost any topic. But it took most people a while to realize that this new type of chatbot often makes things up.

When Google introduced a similar chatbot several weeks later, technology/google-ai-chatbot-bard-offers-inaccurate-information-company-ad-2023-02-08/” title=”” rel=”noopener noreferrer” target=”_blank”>spewed nonsense on the James Webb telescope. The next day, Microsoft’s new Bing chatbot offered all kinds of false information about Gap, Mexican nightlife, and singer Billie Eilish. Then, in March, ChatGPT cited a half-dozen bogus court cases while writing a 10-page legal brief that a lawyer submitted to a federal judge in Manhattan.

Now, a new company called Vectara, founded by former Google employees, is trying to find out how often chatbots deviate from the truth. The company’s research estimates that even in situations designed to prevent this from happening, chatbots make up information at least 3 percent of the time, and up to 27 percent.

Experts call this chatbot behavior “hallucination.” It may not be a problem for people who play with chatbots on their personal computers, but it is a serious problem for anyone who uses this technology with court documents, medical information, or sensitive business data.

Because these chatbots can respond to almost any request in an unlimited number of ways, there is no way to definitively determine how often they hallucinate. “You would have to look at all the information in the world,” said Simon Hughes, the Vectara researcher who led the project.

Dr. Hughes and his team asked these systems to perform a simple, straightforward task that could be easily verified: summarize news articles. Even then, chatbots were persistently inventing information.

“We gave the system 10 to 20 pieces of data and asked for a summary of that data,” said Amr Awadallah, CEO of Vectara and a former Google executive. “That the system can still introduce errors is a fundamental problem.”

Researchers maintain that when these chatbots perform other tasks (beyond just summarizing), hallucination rates may be higher.

Their research also showed that hallucination rates vary widely among major ai companies. OpenAI technologies had the lowest rate, around 3 percent. The systems of Meta, owner of Facebook and Instagram, were around 5 percent. The Claude 2 system offered by Anthropic, an OpenAI rival also based in San Francisco, topped 8 percent. A Google system, Palm Chat, had the highest rate at 27 percent.

An Anthropic spokeswoman, Sally Aldous, said: “Making our systems useful, honest and harmless, which includes preventing hallucinations, is one of our main goals as a company.”

Google declined to comment, and OpenAI and Meta did not immediately respond to requests for comment.

With this research, Dr. Hughes and Mr. Awadallah want to show people that they should be careful with the information that comes from chatbots and even the service that Vectara sells to companies. Many companies now offer this type of technology for business use.

Headquartered in Palo Alto, California, Vectara is a 30-person startup backed by $28.5 million in seed funding. One of its founders, Amin Ahmad, a former Google artificial intelligence researcher, has been working with this type of technology since 2017, when it was incubated within Google and a handful of other companies.

Just as Microsoft’s Bing search chatbot can retrieve information from the open Internet, Vectara’s service can retrieve information from a company’s private collection of emails, documents and other files.

The researchers also hope that their methods, which they share publicly and will continue to update, will help spur industry-wide efforts to reduce hallucinations. OpenAI, Google and others are working to minimize the problem using a variety of techniques, although it is unclear if they will be able to eliminate it.

“A good analogy is a self-driving car,” said Philippe Laban, a Salesforce researcher who has long explored this type of technology. “You can’t prevent a self-driving car from crashing. But you can try to make sure it is safer than a human driver.”

Chatbots like ChatGPT run on a technology called large language model, or LLM, which learns its skills by analyzing huge amounts of digital text, including books, Wikipedia articles, and online chat logs. By identifying patterns in all that data, an LLM learns to do one thing in particular: guess the next word in a sequence of words.

Because the Internet is full of false information, these systems repeat the same falsehoods. They are also based on probabilities: what is the mathematical probability that the next word is “playwright”? Occasionally they guess incorrectly.

New research from Vectara shows how this can happen. When summarizing news articles, chatbots do not repeat falsehoods from other parts of the Internet. They simply get the summary wrong.

For example, the researchers asked Google’s large language model, Palm Chat, to summarize this short passage from a news article:

The plants were found during a search of a warehouse near Ashbourne on Saturday morning. Police said they were in “an elaborate grow house.” A man in his 40s was arrested at the scene.

He offered this summary, completely inventing a value for the plants the man was growing and assuming (perhaps incorrectly) that they were cannabis plants:

Police arrested a man in his 40s after cannabis plants worth an estimated £100,000 were found in a warehouse near Ashbourne.

This phenomenon also shows why a tool like Microsoft’s Bing chatbot can do things wrong when retrieving information from the Internet. If you ask the chatbot a question, it can call up Microsoft’s Bing search engine and perform an Internet search. But it has no way of pointing out the correct answer. It takes the results of that Internet search and summarizes them.

Sometimes this summary is very flawed. Some bots will quote completely made-up Internet addresses.

Companies like OpenAI, Google, and Microsoft have developed ways to improve the accuracy of their technologies. OpenAI, for example, is trying to refine its technology with feedback from human evaluators, who rate the chatbot’s responses, separating useful and truthful responses from those that are not. Then, using a technique called reinforcement learning, the system spends weeks analyzing the ratings to better understand what is fact and what is fiction.

But researchers warn that chatbot hallucinations are not an easy problem to solve. Because chatbots learn from patterns in data and operate according to probabilities, they behave in unintended ways at least sometimes.

To determine how often chatbots stumbled when summarizing news articles, Vectara researchers used another large language model to verify the accuracy of each summary. Only in this way could such a large number of summaries be effectively checked.

But James Zou, a Stanford computer science professor, said this method comes with a caveat. The language model that performs the verification can also make errors.

“The hallucination detector could be fooled or hallucinate itself,” he said.