Rand Fishkin along with Mike King may have published one of the biggest data leaks outside of the Department of Justice reveal around Google Search and its internal ranking features and signals. The document was from an anonymous source (no longer anonymous, see below) but verified by Rand Fishkin and contains a ton of details on how Google Search reportedly works.
More importantly, it seems to contradict a number of the Google statements made over the past two decades from numerous Google Search employees, as I covered here over the past.
I have not gone through it all yet but I felt it was important for you all to read this yourself, you can see the details at these headlines:
- An Anonymous Source Shared Thousands of Leaked Google Search API Documents with Me; Everyone in SEO Should See Them, SparkToro
- Secrets from the Algorithm: Google Search’s Internal Engineering Documentation Has Leaked, iPullRank
Rand wrote, "Many of their claims directly contradict public statements made by Googlers over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more."
Mike King wrote, "I have reviewed the API reference docs and contextualized them with some other previous Google leaks and the DOJ antitrust testimony. I’m combining that with the extensive patent and whitepaper research done for my upcoming book, The Science of SEO. While there is no detail about Google’s scoring functions in the documentation I’ve reviewed, there is a wealth of information about data stored for content, links, and user interactions. There are also varying degrees of descriptions (ranging from disappointingly sparse to surprisingly revealing) of the features being manipulated and stored. You’d be tempted to broadly call these “ranking factors,” but that would be imprecise."
Aleyda Solis has a quick summary on X where she summed up part of the leak:
- There are 14K ranking features and more in the docs
- Google has a feature they compute called “siteAuthority”
- Navboost has a specific module entirely focused on click signals representing users as voters and their clicks are stored as their votes
- Google stores which result has the longest click during the session
- Google has an attribute called hostAge that is used specifically “to sandbox fresh spam in serving time"
- One of the modules related to page quality scores features a site-level measure of views from Chrome
I have not had time to go through everything yet, I will do that over the next several days.
I have also not seen any Googler publicly comment on this yet - I know it is new and I don't know if we will see any Googler comment on this.
This reminds me a bit like the Yandex search ranking leak.
Update: Google has confirmed with me that the data leak is real but urged caution when making assumptions on how and if Google uses what is in these documents. Google told me:
We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information. We've shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation.
Here are some posts on social about this - again, this has only been out for a few hours and no one but Rand and Mike had any real time to process this in super detail.
A huge thanks to @iPullRank, whom I contacted on Friday after seeing the leak, and who helped analyze and decipher much of these early findings: https://t.co/JGYdGydKlC
— Rand Fishkin (follow @randderuiter on Threads) (@randfish) May 28, 2024
Ok, let's get this party started!
— Mic King (@iPullRank) May 28, 2024
A couple weeks ago I said I was publishing the most important thing I ever wrote. I was wrong.
Documentation related to the Google Search algorithm leaked and I spent the weekend tearing it apart.https://t.co/v71B16Ggov
✌🏾
🚨 Google Search’s Internal Engineering Documentation Has Leaked and analyzed by @iPullRank 👀 Many of these had been denied to be used by Google👇
— Aleyda Solis 🕊️ (@aleyda) May 28, 2024
* There are 14K ranking features and more in the docs
* Google has a feature they compute called “siteAuthority”
* Navboost has… pic.twitter.com/dlpCIQdpDm
Until it (possibly) gets taken down by Google's lawyers, here's a direct link to the leaked Google ranking API docs
— Cyrus SEO (@CyrusShepard) May 28, 2024
"google_api_content_warehouse v0.4.0"
Save these pages! https://t.co/8RgmoF69z9 pic.twitter.com/9dXobbr2U1
Extremely interesting blog post by @iPullRank.
— Gianluca Fiorelli (@gfiorelli1) May 28, 2024
Another one of the many he writes and we save for is usefulness ⬇️ https://t.co/VZH8EARV1G
Apparently someone at Google Search "accidentally" leaked an engineering document that reveals a ton of secrets about how the search engine works, including that they have a "Golden Document" flag which puts more weight on a document that is "Human labeled" which could mean some… pic.twitter.com/zeG79f161B
— Joe Youngblood (@YoungbloodJoe) May 28, 2024
If you want to geek out on this with me, I'll keep updating this Google Doc for the next ~30 minutes with anything interesting before getting back to normal life.https://t.co/1iQ40nknZ0
— Glen Allsopp 👾 (@ViperChill) May 28, 2024
#Google Search #Leak Reveals 14,000+ Ranking Factors... Including "Baby Panda Demotion"?!?
— Shay Harel (@RangerShay) May 28, 2024
Looks like Panda got demoted... but to a BABY PANDA? Guess Google's going soft on low-quality sites these days pic.twitter.com/Ob2bndHnzH
I don't think years of personal experience with seeing Google's algorithm respond completely opposite to what all the talking heads were saying is preconceived bias. They have been lying through their teeth since day one, and anyone with even basic SEO experience who was around…
— Greg Boser (@GregBoser) May 28, 2024
You find the commit here: https://t.co/4CqyJZXqZy
— Fili 🇪🇺 🇳🇱 (@filiwiese) May 28, 2024
I am looking forward to really digging in on this.
Update: I briefly went through those two stories and dug a bit into the actual API documentation and honestly, based on everything I've followed over the past 20+ years around Google Search - these really look legit. Some of the specifics in these docs I heard both on and off the record as real ranking features, some are no longer used from what I understand and some I do not know how they are used (i.e. directly for ranking or after the fact ranking validation). It is worth digging through these docs in detail, in my opinion.
Update 2: The source of the leak has spoken out - Erfan Azimi emailed me this video:
Forum discussion at X.