Why Google’s AI Overviews gets things wrong (2024)

Unfortunately, AI systems are inherently unreliable. Within days of AI Overviews’ release in the US, users were sharing examples of responses that were strange at best. It suggested that users add glue to pizza or eat at least one small rock a day, and that former US president Andrew Johnson earned university degrees between 1947 and 2012, despite dying in 1875.

On Thursday, Liz Reid, head of Google Search, announced that the company has been making technical improvements to the system to make it less likely to generate incorrect answers, including better detection mechanisms for nonsensical queries. It is also limiting the inclusion of satirical, humorous, and user-generated content in responses, since such material could result in misleading advice.

But why is AI Overviews returning unreliable, potentially dangerous information? And what, if anything, can be done to fix it?

How does AI Overviews work?

In order to understand why AI-powered search engines get things wrong, we need to look at how they’ve been optimized to work. We know that AI Overviews uses a new generative AI model in Gemini, Google’s family of large language models (LLMs), that’s been customized for Google Search. That model has been integrated with Google’s core web ranking systems and designed to pull out relevant results from its index of websites.

Most LLMs simply predict the next word (or token) in a sequence, which makes them appear fluent but also leaves them prone to making things up. They have no ground truth to rely on, but instead choose each word purely on the basis of a statistical calculation. That leads to hallucinations. It’s likely that the Gemini model in AI Overviews gets around this by using an AI technique called retrieval-augmented generation (RAG), which allows an LLM to check specific sources outside of the data it’s been trained on, such as certain web pages, says Chirag Shah, a professor at the University of Washington who specializes in online search.

Once a user enters a query, it’s checked against the documents that make up the system’s information sources, and a response is generated. Because the system is able to match the original query to specific parts of web pages, it’s able to cite where it drew its answer from—something normal LLMs cannot do.

One major upside of RAG is that the responses it generates to a user’s queries should be more up to date, more factually accurate, and more relevant than those from a typical model that just generates an answer based on its training data. The technique is often used to try to prevent LLMs from hallucinating. (A Google spokesperson would not confirm whether AI Overviews uses RAG.)

So why does it return bad answers?

But RAG is far from foolproof. In order for an LLM using RAG to come up with a good answer, it has to both retrieve the information correctly and generate the response correctly. A bad answer results when one or both parts of the process fail.

It’s not just about bad training data

Although the pizza glue blunder is a good example of a case where AI Overviews pointed to an unreliable source, the system can also generate misinformation from factually correct sources. Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico, googled “How many Muslim presidents has the US had?’” AI Overviews responded: “The United States has had one Muslim president, Barack Hussein Obama.”

While Barack Obama is not Muslim, making AI Overviews’ response wrong, it drew its information from a chapter in an academic book titled Barack Hussein Obama: America’s First Muslim President? So not only did the AI system miss the entire point of the essay, it interpreted it in the exact opposite of the intended way, says Mitchell. “There’s a few problems here for the AI; one is finding a good source that’s not a joke, but another is interpreting what the source is saying correctly,” she adds. “This is something that AI systems have trouble doing, and it’s important to note that even when it does get a good source, it can still make errors.”

Can the problem be fixed?

Ultimately, we know that AI systems are unreliable, and so long as they are using probability to generate text word by word, hallucination is always going to be a risk. And while AI Overviews is likely to improve as Google tweaks it behind the scenes, we can never becertain it’ll be 100% accurate.

Google has said that it’s adding triggering restrictions for queries where AI Overviews were not proving to be especially helpful and has added additional “triggering refinements” for queries related to health. The company could add a step to the information retrieval process designed to flag a risky query and have the system refuse to generate an answer in these instances, says Verberne. Google doesn’t aim to show AI Overviews for explicit or dangerous topics, or for queries that indicate a vulnerable situation, the company spokesperson says.

Techniques like reinforcement learning from human feedback, which incorporates such feedback into an LLM’s training, can also help improve the quality of its answers.

Similarly, LLMs could be trained specifically for the task of identifying when a question cannot be answered, and it could also be useful to instruct them to carefully assess the quality of a retrieved document before generating an answer, Verbene says: “Proper instruction helps a lot!”

Although Google has added a label to AI Overviews answers reading “Generative AI is experimental,” it should consider making it much clearer that the feature is in beta and emphasizing that it is not ready to provide fully reliable answers, says Shah. “Until it’s no longer beta—which it currently definitely is, and will be for some time— it should be completely optional. It should not be forced on us as part of core search.”

Why Google’s AI Overviews gets things wrong (2024)

How does AI Overviews work?

So why does it return bad answers?

It’s not just about bad training data

Can the problem be fixed?

References