In the latests episode of the Search Off The Record podcast with Googlers John Mueller, Gary Illyes, Martin Splitt and this week's special guest Mariya Moeva - they spoke about Google's SiteKit and then also about the serving index. Gary gave a summary of how Google's serving index works.
In short, Gary said he last spoke about Google's indexing tiers and storage and now he wanted to explain how the results in the index are served to searchers - i.e. the serving indexing. Gary said "the serving index is actually what is in our data centers and from where people get their search results on their screens."
The serving index he said is "essentially a lot of index shards that were pushed by Caffeine into our serving data centers." Gary explained that "Each of these data centers would get between 10 and 15 somewhere. Each of these data centers would get the index shards. Each of these index shards would contain the documents that we have indexed."
What is in these documents? "These documents are not the things that we grabbed from a URL. They are broken down into tokens. Basically, we tokenize them because we don't need all the fluff that comes with the HTML," Gary explained. "For example, script tags. Why would we want to index those tokens, those keywords, or key phrases from pages? We just don't need them. Certain HTML elements we do need because of reasons I will not say."
"Then these index shards are distributed among the data centers," Gary said. "Each data center will have a duplicate of the shards because that's how it should be, so each data center can serve relatively the same documents as the results, if needed."
He goes into a lot of detail on how they work, here is the video embed where he talks about this in more detail. It starts at 14:29 into the talk:
Here is the transcript:
[00:14:26] Gary Illyes: Oh. Okay, then I will talk. One of the last episodes that we had, I was talking about indexing, and we were talking. We have different kinds of storages that we use based on how often we think that documents indexing those tiers would be served.
[00:14:45] But we haven't talked about the serving index, which is slightly less abstract than what we were talking about in a past episode. The serving index is actually what is in our data centers and from where people get their search results on their screens.
[00:15:05] I think it's not that much of an interesting topic. It's just I want to cover it before we actually move into serving because it feels that if I don't, then people might misunderstand things, which would never happen ever on the internet.
[00:15:23] The serving index, that's essentially a lot of index shards that were pushed by Caffeine into our serving data centers. I don't remember the exact number of data centers that we have for serving web search-- search in general-- but it's over ten.
[00:15:43] Each of these data centers would get between 10 and 15 somewhere. Each of these data centers would get the index shards. Each of these index shards would contain the documents that we have indexed.
[00:16:00] These documents are not the things that we grabbed from a URL. They are broken down into tokens. Basically, we tokenize them because we don't need all the fluff that comes with the HTML.
[00:16:16] For example, script tags. Why would we want to index those tokens, those keywords, or key phrases from pages? We just don't need them. Certain HTML elements we do need because of reasons I will not say.
[00:16:32] John Mueller: Emojis, right? We need them, too.
[00:16:34] Gary Illyes: Yeah, we need them. Those are very important indeed.
[00:16:38] We will keep certain HTML elements. We will keep the actual words that appear on the page and their positions on the page because that's also important, as we've said a number of times before.
[00:16:53] Then these index shards are distributed among the data centers. Each data center will have a duplicate of the shards because that's how it should be, so each data center can serve relatively the same documents as the results, if needed.
[00:17:09] Of course, that doesn't always happen. Sometimes, some shard might lag behind in a data center, then interesting things can happen. Like, you search for something, let's say, cookies, and then Martin also searches for cookies, and they get completely different results.
[00:17:27] That's sometimes because we are querying different data centers. Hence, the index shards are different between those data centers that we are querying.
[00:17:37] The index shards are-- I like to think of them as RAR part files, like a packaged part file. I keep bringing this up, but back in the '90s, for example, when we were installing Doom, Quake, or Age of Empires, for example, then we got these floppy disks. I remember that...
[00:17:58] Martin Splitt: Yes, Martin, floppy disk! Whoo-hoo!
[00:18:01] Gary Illyes: No, Martin, sit down.
[00:18:04] For example, Age of Empires came on 30-something floppy disks, Doom came on, I think, 12, then Diablo I that came on 50-something. You had to insert each floppy disk into your floppy drive, copy over the files that you found there, unite them, and then you would have the final executable that you would use to run your game.
[00:18:31] The index shards are not so dissimilar from that, conceptually. They are, essentially, a part of the index, altogether forming the entirety of the index.
[00:18:44] We have many index shards in many data centers. I don't know the number, but order of thousands, or tens of thousands, even. That poses a challenge. The challenge is that you have to find vast documents in those index shards.
[00:19:01] If you think about it, when you search for something, you get the results under one second. If you have to look in all index shards for every query, you are not going to deliver results in under one second because even the smallest index shards would be several megabytes big. Going through all the records that you have in a shard will take time.
[00:19:27] To help serving identifying the index shard that needs to be queried, we have something called "shard indexes," which identifies the shards for certain queries, which is basically a map between the keywords that we encountered, or token that we encountered on pages, mapped to the index shard's number or identifier.
[00:19:55] But that will not be enough to seek inside the index shard. For that, we need a new map, which is what we call "the posting list." That identifies the document ID that contains a certain keyword, for example.
[00:20:14] Like, if you search for "oatmeal cookies," for example, then the posting list would tell us that the word "oatmeal" appears in the documents 1, 2, 3, 4, 5, 6, 7, and "cookies" would appear in 5, 6, 7, 8, 9, 10. Then we would send the intersection of the two up to serving.
[00:20:43] This is oversimplified. There are other processes that take place, for example, the tokenization itself, which can be a challenge in certain languages. But, conceptually, this is how we build our serving index.
[00:20:57] John Mueller: So cool. So it's kind of like the index in the back of a book where you see the page number. Then on that page, with the posting list, you figure out, "Oh, it's line 17" or something like that.
[00:21:09] Gary Illyes: Yeah, that's literally what it is. If I remember correctly, that's where the idea came from, actually.
Forum discussion at Twitter.