Jeff Dean On Combining Google Search With LLM In-Context Learning

Google Servers

Dwarkesh Patel interviewed Jeff Dean and Noam Shazeer of Google and one topic he asked about what would it be like to merge or combine Google Search with in-context learning. It resulted in a fascinating answer from Jeff Dean.

Before you watch, here is a definition you might need:

In-context learning, also known as few-shot learning or prompt engineering, is a technique where an LLM is given examples or instructions within the input prompt to guide its response. This method leverages the model's ability to understand and adapt to patterns presented in the immediate context of the query.

The context window (or “context length”) of a large language model (LLM) is the amount of text, in tokens, that the model can consider or “remember” at any one time. A larger context window enables an AI model to process longer inputs and incorporate a greater amount of information into each output.

This question and answer starts at the 32 minute mark in this video:

Here is the transcript if you do not want to read this:

Question:

I know one thing you're working on right now is longer context. If you think of Google Search, it's got the entire index of the internet in its context, but it's a very shallow search. And then obviously language models have limited context right now, but they can really think. It's like dark magic, in-context learning. It can really think about what it’s seeing. How do you think about what it would be like to merge something like Google Search and something like in-context learning?

Yeah, I'll take a first stab at it because – I've thought about this for a bit. One of the things you see with these models is they're quite good, but they do hallucinate and have factuality issues sometimes. Part of that is you've trained on, say, tens of trillions of tokens, and you've stirred all that together in your tens or hundreds of billions of parameters. But it's all a bit squishy because you've churned all these tokens together. The model has a reasonably clear view of that data, but it sometimes gets confused and will give the wrong date for something. Whereas information in the context window, in the input of the model, is really sharp and clear because we have this really nice attention mechanism in transformers. The model can pay attention to things, and it knows the exact text or the exact frames of the video or audio or whatever that it's processing. Right now, we have models that can deal with millions of tokens of context, which is quite a lot. It's hundreds of pages of PDF, or 50 research papers, or hours of video, or tens of hours of audio, or some combination of those things, which is pretty cool. But it would be really nice if the model could attend to trillions of tokens.
Could it attend to the entire internet and find the right stuff for you? Could it attend to all your personal information for you? I would love a model that has access to all my emails, all my documents, and all my photos. When I ask it to do something, it can sort of make use of that, with my permission, to help solve what it is I'm wanting it to do.
But that's going to be a big computational challenge because the naive attention algorithm is quadratic. You can barely make it work on a fair bit of hardware for millions of tokens, but there's no hope of making that just naively go to trillions of tokens. So, we need a whole bunch of interesting algorithmic approximations to what you would really want: a way for the model to attend conceptually to lots and lots more tokens, trillions of tokens. Maybe we can put all of the Google code base in context for every Google developer, all the world's source code in context for any open-source developer. That would be amazing. It would be incredible.

Here is where I found this:

Relevant: pic.twitter.com/N8fECkK36M
— DEJAN (@dejanseo) February 15, 2025

I'm enamored of combining many approaches. Here are some that are interesting and public:

Various dense retrieval methods

TreeFormer (https://t.co/aplh2tS9DM)

High-Recall Approximate Top-K Estimation (https://t.co/rVcYm5vltU)

Various forms of KV cache quantization and…
— Jeff Dean (@JeffDean) February 15, 2025

Forum discussion at X.

Jeff Dean: Combining Google Search With LLM In-Context Learning

Barry Schwartz / Executive Editor

Popular Categories

The Pulse of the search community

Search Video Recaps

Most Recent Articles

Daily Search Forum Recap: April 14, 2025

New Google Ad Strength Guide Best Practices

Google Ask A Follow-Up Search Box

Bing Tests Dark Gray Favicons Backgrounds

Google Product Carousel Recommended By Article Carousel

Bing Related Searches Tests Trending Icon