Google released the next the next Search Off the Record podcast, which was actually recorded at least two months ago, Gary Illyes from Google broke down what the Google caffeine index and system actually does.
If you remember, a problem with caffeine was one of the reasons something broke in Google Search a little while ago.
Here is the recording but this part of the conversation starts at about 9 minutes in:
Here is what Gary said:
We have Caffeine. That's our indexing system. Only externally it's called Caffeine. Internally, it has some other name. But that doesn't really matter. And it does many things. And I think that's not very clear externally that it does many things. For people, it's just like we have the crawler, which is Googlebot, and then that goes to something something Google magic. Well, people know that it gets rendered, and then something something Google magic, and then we have an index.We can't actually break down that Google magic, and people in general know that Google magic, or could figure it out if they wanted to, but that Google magic is essentially what Caffeine is doing. Basically, ingesting, picking up whatever is produced by Googlebot, which is a protocol buffer-- you can look it up on your favorite search engine what a protocol buffer is. And then that protocol buffer is picked up by Caffeine, and then we collect signals, blah, blah, blah, and then we add the information that Caffeine produced into our index.
What's happening inside Caffeine? Well, the very first step is that protocol buffer ingestion. Basically, it picks up the protocol buffer and starts processing it. The very first step after the ingestion is conversion.
Martin then stops Gary to explain what the conversion part means. Gary goes on to explain. It does convert the protocol buffer into a different format but it also has to normalize the HTML.
But we still try to make sense of it. If you have really broken HTML, then that's kind of hard. So we push all the HTML through an HTML lexer. Again, search for the name. You can figure out what that is. But, basically, we normalize the HTML. And then, it's much easier to process it. And then, there comes the hotstepper: h1, h2, h3, h4.I know. All these header tags are also normalized through rendering. We try to understand the styling that was applied on the h tags, so we can determine the relative importance of the h tags compared to each other. Let's see, what else we do there?
Do we also convert things, like PDFs or... Oh, yeah. Google Search can index many formats, not just text HTML, we can index PDFs, we can index spreadsheets, we can index Word document files, we can index... What else? Lotus files, for some reason.
Wait. Going back to PDF. PDF is a binary format. It's not that easy to process. So for that, as far as I remember, we license decoder from Adobe that we use to basically convert the PDF to HTML. And then from there on, we are just working with HTML. This happens with all the other binary formats that we can index in web search. Of course, those are also normalized. So the HTML, eventually, will be very well-formed.
We then start looking at meta tags because there are a few meta tags that we deeply care about. For example, the meta name="robots."
when they happen, when they show up, in our processing pipelines. And that's what this error handling page thing does. Basically, we have very large corpus, well, actually, corpora, of error pages, and then we try to match text against those.
This can also lead to very funny bugs, I would say, where, for example, you are writing an article about error pages in general, and you can't, for your life, get it indexed. And that's sometimes because our error page handling systems misdetect your article, based on the keywords that you use, as a soft error page. And, basically, it prompts Caffeine to stop processing those pages.
And, of course, error page handling also works on other kinds of error pages, not just the 404s. For example, if the server sends "I'm overloaded" message HTML page but a 200 status code, then we might be able to understand that. We have redirects that are not so obvious, and we can detect those as well. What else?
We also try to detect login pages here. I'm not sure why is that useful, but we know about login pages.
So as you can see, it does a lot, really, a lot.
It is definitely worth listening to. The whole section goes on for about 10 minutes.
Oh, Gary might do some sort of recording for his Life of a Query talk but not for internal use only, but rather for the public.
Forum discussion at Twitter.