One of the strengths of Google's flagship generative ai models, the Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can supposedly process and analyze. In press conferences and demos, Google has repeatedly claimed that the models can perform tasks that were previously impossible thanks to their “long context,” such as summarizing multiple documents of hundreds of pages or searching through scenes from movie footage.
But new research suggests that, in fact, the models aren't very good at those things.
Two x.com/m2saxon/status/1805452166171443221?t=RgnCc6gVGz_mOh8Gvm0MgA&s=19″ target=”_blank” rel=”noreferrer noopener nofollow”>separate x.com/mar_kar_/status/1805660949023793224?t=CdEkD5cCSbPEBKb16eKdaw&s=19″ target=”_blank” rel=”noreferrer noopener nofollow”>studies investigated how well Google’s and other Gemini models make sense of massive amounts of data — think of the “War and Peace” papers. Both find that Gemini 1.5 Pro and 1.5 Flash struggle to correctly answer questions about large data sets; in a series of paper-based tests, the models got the answer right only 40% and 50% of the time.
“While models like Gemini 1.5 Pro can technically process long contexts, we've seen many cases that indicate the models don't actually 'understand' the content,” Marzena Karpinska, a postdoc at UMass Amherst and co-author of one of the studies, told TechCrunch.
Gemini context window missing
A model's context, or context window, refers to the input data (e.g., text) that the model considers before generating results (e.g., additional text). A simple question: “Who won the 2020 US presidential election?” – can serve as context, just like a movie script, show, or audio clip. And as contextual windows grow, so does the size of the documents that fit in them.
Newer versions of Gemini can accept more than 2 million tokens as context. (“Tokens” are subdivided bits of raw data, like the syllables “fan,” “tas,” and “tic” in the word “fantastic.”) That’s roughly equivalent to 1.4 million words, two hours of video, or 22 hours of audio — the largest context of any commercially available model.
In a briefing earlier this year, Google showed off several pre-recorded demos aimed at illustrating the potential of Gemini's long-context capabilities. In one of them, Gemini 1.5 Pro searched the transcript of the Apollo 11 moon landing broadcast (about 402 pages) for quotes containing jokes and then found a scene in the broadcast that looked similar to a pencil sketch.
Oriol Vinyals, vice president of research at Google DeepMind, who led the briefing, described the model as “magical.”
“(1.5 Pro) performs these kinds of reasoning tasks on every page, on every word,” he said.
It might have been an exaggeration.
In one of the aforementioned studies comparing these capabilities, Karpinska, along with researchers at the Allen Institute for ai and Princeton, asked models to evaluate true/false statements about fiction books written in English. The researchers chose recent works so that the models couldn't “cheat” by relying on prior knowledge, and peppered the statements with references to specific details and plot points that would be impossible to understand without reading the books in their entirety.
Faced with a statement like “Using her abilities as Apoth, Nusis is able to reverse engineer the type of portal opened by the reagent key found in Rona's wooden chest”, Gemini 1.5 Pro and 1.5 Flash, after having ingested the relevant book, they had to tell whether the statement was true or false and explain their reasoning.
Researchers tested a book approximately 260,000 words (~520 pages) in length and found that 1.5 Pro answered true/false statements correctly 46.7% of the time, while Flash answered correctly only 20% of the time. . That means one coin is significantly better at answering questions about the book than Google's latest machine learning model. When averaging all baseline results, none of the models managed to achieve above chance accuracy in terms of answering questions.
“We have observed that the models have a harder time verifying claims that require considering larger portions of the book, or even the entire book, compared to claims that can be resolved by retrieving sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the models have difficulty verifying claims about implicit information that is clear to a human reader but is not explicitly stated in the text.”
The second of the two studies, co-authored by UC Santa Barbara researchers, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reason about” videos, that is, search for and answer questions about the content of them. .
The coauthors created a dataset of images (e.g., a photo of a birthday cake) along with questions for the model to answer about the objects depicted in the images (e.g., “What cartoon character is on this cake?”). To evaluate the models, they chose one of the images at random and inserted “distractor” images before and after it to create slideshow-like images.
Flash did not perform well. In a test where the model transcribed six handwritten digits from a 25-image “slideshow,” Flash managed to correctly transcribe about 50% of the transcriptions. Accuracy dropped to about 30% with eight digits.
“In real image question answering tasks, it seems to be particularly difficult for all the models we tested,” Michael Saxon, a doctoral student at UC Santa Barbara and one of the study's co-authors, told TechCrunch. “That small amount of reasoning (recognizing that a number is in a frame and reading it) could be what's breaking the model.”
Google promises too much with Gemini
Neither study has been peer-reviewed, nor do they test versions of Gemini 1.5 Pro and 1.5 Flash with 2 million token contexts (both tested versions with 1 million token contexts). And Flash is not intended to be as capable as Pro in terms of performance; Google advertises it as a low-cost alternative.
However, both add fuel to the fire of Google's overpromising (and underpromising) with Gemini from the start. None of the models the researchers tested, including OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet, performed well. But Google is the only model provider that has given the context window a prominent place in its ads.
“There's nothing wrong with simply stating, 'Our model can take x amount of tokens' based on objective technical details,” Saxon said. “But the question is, what useful thing can you do with it?”
Generative ai, more broadly, is coming under increased scrutiny as businesses (and investors) grow increasingly frustrated with the technology’s limitations.
In a pair of recent Boston Consulting Group surveys, about half of respondents (all C-suite executives) said they don't expect generative ai to drive substantial productivity gains and are concerned about the potential for errors and data compromises. that arise from generative tools powered by ai. PitchBook recently ai-seed-funding-drops” target=”_blank” rel=”noreferrer noopener nofollow”>reported that, for two consecutive quarters, deal negotiation using generative ai in the early stages has declined, falling 76% from its peak in the third quarter of 2023.
With chatbots that summarize meetings and invoke fictitious details about people and ai search platforms that are basically plagiarism generators, customers are on the hunt for promising differentiators. Google, which has raced, sometimes clumsily, to catch up with its generative ai rivals, was desperate to make the Gemini context one of those differentiators.
But it seems that the bet was premature.
“We haven't decided on a way to actually demonstrate that 'reasoning' or 'understanding' is occurring across long papers, and basically every group publishing these models is cobbling together their own ad hoc assessments to make these claims,” Karpinska said. . “Without knowing how long context processing is implemented (and companies do not share these details) it is difficult to say how realistic these claims are.”
Google did not respond to a request for comment.
Both Saxon and Karpinska believe the antidotes to the exaggerated claims about generative ai are better benchmarks and, along the same lines, a greater emphasis on third-party critique. Saxon notes that one of the most common tests for long context (cited liberally by Google in its marketing materials), “the needle in the haystack,” only measures a model’s ability to retrieve particular information, such as names and numbers, from data sets, not answer complex questions about that information.
“All the scientists and most engineers who use these models basically agree that our current baseline culture is broken,” Saxon said, “so it's important for the public to understand that they should take these gigantic reports that contain numbers like 'general intelligence across all benchmarks' with great caution.”