Yandex had a boatload of its source code across all its technology allegedly leaked by a disgruntled employee and part of that was the source code for Russia's largest search engine - Yandex. As you can imagine, SEOs and others are diving in and seeing what they can learn from the source code.
I personally did not download the source code, so I did not go through it myself but I wanted to share what people did find via Twitter from their investigations of the source code.
Here's the alpha version of an explorer tool for the leaked #Yandex Search code.
— Rob Ousbey : @[email protected] (@RobOusbey) January 28, 2023
It lets you browse through the ranking factors, view by tags, etc, and start to find connections.
Easy to add new features if there's anything you want to see!https://t.co/AjbYnrDl9P pic.twitter.com/pQ4scOkP6w
I downloaded the code, analyzed it and there is a lot of useful information for Google SEO as well. pic.twitter.com/RWrgnnlpj6
— Alex Buraks (@alex_buraks) January 27, 2023
Theoretically, what is the difference between algorithms used in Google and in Yandex?
— Alex Buraks (@alex_buraks) January 27, 2023
They are quite similar:
- there is RankBrain analogue - MatrixNet;
- they are using PageRank (almost the same as in Google);
- a lot of text algorithms are the same. pic.twitter.com/Djjl8Bmjwn
According to Statcounter Yandex is close to Yahoo and Bing by market share: pic.twitter.com/5GKIvKIvAo
— Alex Buraks (@alex_buraks) January 27, 2023
Main insights after analysing this list:
— Alex Buraks (@alex_buraks) January 27, 2023
#1 Age of links is a ranking factor. pic.twitter.com/U47uWvEq9w
#3 Numbers in URLs is bad for rankings pic.twitter.com/ECgwGeGUfb
— Alex Buraks (@alex_buraks) January 27, 2023
#5 Hard pessimization equal PR=0 pic.twitter.com/RRbhuJyZr1
— Alex Buraks (@alex_buraks) January 27, 2023
#7 Fun fact - there is a separate ranking factor for uplifting Wikipedia pic.twitter.com/799F8KFpkE
— Alex Buraks (@alex_buraks) January 27, 2023
#9 Document age and last update both are ranking factors. pic.twitter.com/ay1GTMVEtJ
— Alex Buraks (@alex_buraks) January 27, 2023
Right now I checked ~40% of the list, there are a lot more (about text relevancy, behaivor factors, page rank, internal links,etc).
— Alex Buraks (@alex_buraks) January 27, 2023
Will continue this thread after some time.
The first thread got a lot of impressions (500k views for the moment, thanks for you retweets and likes!), so I decided to finalize.https://t.co/UQiQsnpWd2
— Alex Buraks (@alex_buraks) January 28, 2023
#2 Additionnaly: ranking factor for orphan pages.
— Alex Buraks (@alex_buraks) January 28, 2023
You can easy find them via Screming Frog or other crawlers. pic.twitter.com/zIPwAelpD0
#4 Number of search queries of your site/url is a ranking factor.
— Alex Buraks (@alex_buraks) January 28, 2023
Obviously more = better. pic.twitter.com/xXQ6FMDghP
#6 If your url whould be the last for search session (user will find what he needs) - it whould impact rankings.
— Alex Buraks (@alex_buraks) January 28, 2023
There are strict factors for this and predictible factors as well. pic.twitter.com/Zx3sBZORCs
#8 Special ranking factors for short videos (tiktok, shorts, reels) pic.twitter.com/oKPzL09MID
— Alex Buraks (@alex_buraks) January 28, 2023
#10 Keywords in URL is a ranking factors.
— Alex Buraks (@alex_buraks) January 28, 2023
As we can see from the description - the optimal would be include up to 3 words from the search query. pic.twitter.com/Q1euKWSiST
#14 One more ranking factor for content quality - broken embedded video on the page.
— Alex Buraks (@alex_buraks) January 28, 2023
Embed videos - good for rankings.
Broken embed videos - bad. pic.twitter.com/2SUys65PHp
#16 If you backlinks anchors contain all words from the keywords - it's good for SEO.
— Alex Buraks (@alex_buraks) January 28, 2023
If it is in a one link - it's more beneficial. Especially if the order of words is the same. pic.twitter.com/WrbESJ8Da5
#18 The quality rank of texts on the domain is a ranking factor.
— Alex Buraks (@alex_buraks) January 28, 2023
Pages with low quality content affect the entire domain. pic.twitter.com/MJUCTVB9CH
#20 Funny, there is a random as a separate ranking factor.
— Alex Buraks (@alex_buraks) January 28, 2023
When you don't understant why some of page is on top - it could be just random (to test behaivor factors). pic.twitter.com/TGtzFrmBOV
#22 Backlinks from the top 100 best websites by PageRank impacts on rankings.
— Alex Buraks (@alex_buraks) January 28, 2023
That's not news. pic.twitter.com/ikxldWLJqy
Wow, I just found the list with initial weights of Yandex ranking factors.
— Alex Buraks (@alex_buraks) January 28, 2023
Do you need one more thread? 😁
P.S. final weights calculated by AI (matrixnet), but initial values are useful as well. pic.twitter.com/WeroYQy7Yu
That said, I've been digging into the codebase myself to find things of interest.
— Mic King (@iPullRank) January 27, 2023
I'm doing this live, so I don't know how long it will take between tweets.
A lot of the code related to Yandex Search lives in the Kernel, ExtSearch, Search, and Robot archives, but again I won't be able to be comprehensive here until I've looked through everything.
— Mic King (@iPullRank) January 27, 2023
Some really interesting things in the web_meta_factors_info/factors_gen.in file as it relates to content features and factors.
— Mic King (@iPullRank) January 27, 2023
For instance, some things that we'd expect like a minimum expectation of the proximity of words in a title to the words in the query. pic.twitter.com/YRsrCpVsqU
Interestingly, there are a lot of scrapers in here Google News, Shopping, YouTube and even other Yandex services.
— Mic King (@iPullRank) January 27, 2023
Hmm...this might be the structure of how Yandex stores documents in their version of a doc server.
— Mic King (@iPullRank) January 27, 2023
Still looking for an idea of how they structure their inverted index. pic.twitter.com/1lwTbOirnx
Here's a protobuf of link factors. pic.twitter.com/1RM6o1xzRg
— Mic King (@iPullRank) January 27, 2023
In the "link prioritizer code" they talk about decreasing the priority of links with the same text from the same host. In other words, don't count the links from duplicate content. pic.twitter.com/dQTUnScCUy
— Mic King (@iPullRank) January 27, 2023
How did y'all come up with that number of ranking factors?
— Mic King (@iPullRank) January 28, 2023
I see 481 factors just related to "Rapid Clicks" pic.twitter.com/sw5A3ia3Bk
Similar to the Googs, Yandex has multiple ranking models to choose from.
— Mic King (@iPullRank) January 28, 2023
In this select_ranking_models.cpp file, they talk about having different models for different languages and locations. pic.twitter.com/m210tpOUDb
I'm gonna go watch TV, but I obviously have to add this to my book so I'm gonna add more over the next couple days
— Mic King (@iPullRank) January 28, 2023
Been digging into how this robot archive is structured.
— Mic King (@iPullRank) January 28, 2023
It looks like the Zora directory is where a lot of interesting things are happening. There's a limits.pb.txt file that stores the requests per second rate for the host and the IP address for 204k hosts. pic.twitter.com/0oulKm58dx
Here's where the Document and Query factors are collected and scored.
— Mic King (@iPullRank) January 29, 2023
Looks like it goes to storage after this tho. pic.twitter.com/qJAiLfSrsU
Ok, real quick, top 5 most positively and negatively weighted ranking factors and their coefficients in the initial weighting in Yandex's document relevance calculation. Negatives first
— Mic King (@iPullRank) January 29, 2023
#1 FI_ADV: -0.2509284637
This factor determines that there is advertising on the site.
#3 FI_QURL_STAT_POWER: -0.1943768768
— Mic King (@iPullRank) January 29, 2023
Factor is the number of URL impressions for the request
#5 FI_GEO_CITY_URL_REGION_COUNTRY: -0.168645758
— Mic King (@iPullRank) January 29, 2023
Factor is the geographical coincidence of the document and the country that the user searched from.
Ok, now for the top 5 positively weighted factors.
Here is a starting point for link related factors.https://t.co/fwP8TxuOrM
— Christoph C. Cemper 🇺🇦 🧡 SEO (@cemper) January 30, 2023
Will this help you do SEO on Google? Probably not but hey, it is super interesting.
Ah, but once they find the optimal word count ...
— John Mueller is watching out for Google+ 🐀 (@JohnMu) January 29, 2023
BOOM
Forum discussion at WebmasterWorld.