Google Releases Gemini 1.5 with Insanely Long Context Window & Retrieval Accuracy

Google releases Gemini 1.5

Google has released the next major update to its Gemini LLM, now in version 1.5. Among other improvements, it brings the longest context window and quite surprisingly, near-perfect token recall even on longer context length inputs.

It’s only been two months since Google’s Gemini and they have rolled out Gemini 1.5 to developers and enterprises, which has some neat improvements. Gemini 1.5 will be available to consumers later. The Gemini 1.5 Pro, which is/will be used in Google’s chatbot (previously Bard) is going to be pretty close to Gemini’s most advanced weight, the Ultra, which was released recently bundled with a Google One subscription. The new model has been made faster and more efficient by employing something called “Mixture of Experts,” a technique where the model processes part of the query and not the whole thing.

However, the most notable upgrade is the much longer context window.

The context window is the length of the input that the model can process at a time. A good analogy would be blog posts. A smaller context window means the model can only process a 500-word article at a time, and a longer one will mean it can process a 1000-word article. When you query it for the content of the article, the model needs to look up the entire article, and the longer the context window, the more time it will take. Generally, LLMs are trained with an upper cap on their context window to keep things fast. A very long context window and a very long input will slow it down.

Gemini 1.5 has a whopping context window of 1 million tokens.

OpenAI’s GPT-4 is capped at 128k, Gemini Pro is capped at 32k, and Claude 2.1 has a context window of 200k. In terms of coding-oriented LLMs, CodeLlama is capped at 100k (starts with 16k, then you can pay more to get more). 1 million is remarkable in comparison. A token is not a single word, it’s a unit. To put 1 million tokens in perspective, Gemini 1.5 can accept audio content of up to 10-11 hours, video content of up to 1 hour, 30,000 lines of code, or 700,000 words in one go.

But can the LLM work with longer inputs efficiently and swiftly? As it turns out, it’s scary good in handling these longer inputs. If the charts are to be believed, it has a near-perfect retrieval of up to 10 million tokens! This means that the LLM can easily digest way more code, text, audio, and video easily and give you responses based on the input quite reliably.

In comparison, the largest context window we’ve had so far was Claude 2 with 200k tokens. Still, the retrieval accuracy in long context is simply bad. Gemini 1.5 has a much longer context window and 99% retrieval accuracy, which is why I called it insanely good.

Now, the paper does say that the tested retrieval accuracy is on a 10M token length, but in reality, you’re limited to 1M. Still, it’s vastly significantly superior to other LLMs we have so far. This sure is going to prompt ChatGPT to play catch up.

Google DeepMind’s VP of Research tweeted the following:

In any other LLM’s performance, the retrieval accuracy deteriorates multiplicatively as the token length or input size increases across video, audio, text, or code. To see successful retrieval of this magnitude even on 2-10M tokens is outrageously good.

You can download Google’s report here. The announcement blog post also notes the other improvements, such as better reasoning across all modalities, better problem-solving in longer code inputs, and the opening up of the Gemini infrastructure to developers, allowing developers and enterprises to create experiments using the Gemini 1.5 with 128k token length.

Notably, even when Gemini 1.5 replaces the standard LLM available for free at, free users will be limited to a context length of 128k tokens (which is equal to today’s GPT-4). 

By Abhimanyu

Unwrapping the fast-evolving AI popular culture.