Google's flexible sampling solution that replaced the first-click-free solution for gated, subscription or paywalled content launched in 2017. Since then, many publishers use the paywall structured data to communicate to Google the full content that is behind the content gate. Some are calling this solution "leaky" in which Google responded saying it is not.
Ryan Singel, a journalist covering tech business, tech policy, civil liberty and privacy issues, who has written at Wired and many other respected publications, posted a comment on this site calling this Google solution "leaky." He said:
Google Search and Google News are stuck in the past when it comes to these. It's crawler assumes that paywalled or reg walled content is still going to be in the HTML that Google crawler will see. In other words, it demands leaky bad tech from sites with paywalled or registration required content. It'd be great if it fixed that instead of sending Danny Sullivan out to lecture sites about their markup with directions that don't work for a smart, modern, non-leaky publishing system.
Danny Sullivan, Google's Search Liaison, then responded to that comment on this blog and on X and on Mastodon saying it is not leaky. Here is Danny's response from this blog:
Our system is looking to be shown the full content, if a publisher wants to do that. If they do, we understand more about it. If we understand more, then we might be able to show it for more queries where it's relevant. This doesn't involve using JS to somehow "hide" the content from people who aren't our crawler or anything like that.Basically, you see our crawler, you show us the full content. And only us. And if you're worried that someone is pretending to be us, then you check our publicly shared IP addresses.
Next, you markup the page so we know what's paywalled / gated content so that we -- and only we are seeing this full content -- also know you aren't trying to cloak us by targeting our crawler specifically. Since only we are seeing this, there's nothing "leaky" as you are suggesting. Here's the doc.
Where the "leaky" stuff tends to come in is someone might search with us, then click on the cached copy of a page to see the full thing we saw. And if that's a concern, our guidance is to block the cached copy -- covered in the docs.
I hope that helps explain this more. If I'm missing something, or you have other suggestions, honestly very happy to hear them. I found Outpost and emailed both the info and press addresses, so look for that, happy to continue the conversation.
Sullivan also posted on X, saying:
I mentioned paywall and gated content in my tweet not as some type of lecture but guidance because it's something any publisher doing gated content might want to understand.Gated content isn't something that our crawler can see, unless publishers let us in. If they do, we can better understand the full content they have. In turn, that might help us surface their content for relevant queries.
There's nothing "leaky" about this. That seems to be a suggestion that if someone lets us in, anyone can get in. That's not the case. We can be specifically allowed in. If someone is concerned that makes cached content available, they can also block us showing cached content.
This is all documented and hasn't changed for ages.
He seems to be involved in a company that provides registration systems, I think, to publications? Including the publication I was responding to? I'll reach out to his site to see if there are other suggestions on what we might do to help publishers with paywall / gated content issues. We're always open to that.
Some replied to that saying that you, a user, can change their user agent to a Googlebot. But technically, if you do the Googlebot IP verification method, you can block those attempts:
No offence,
— Darth Autocrat (Lyndon NA) (@darth_na) January 20, 2024
but you're showing a lack of knowledge/understanding.
The current process "leaks".
How does Google can access to the full content?
Does it log in?
Does it supply special credential headers?
No.
All people have to do,
is set their UA to GoogleBot.
And let's not forget that Google does label content served through flexible sampling or that has a paywall requirement. I get complaints from my readers when I link to articles and do not mention there is a content gate on it. I mean, a label would be nice from Google, so at least you know before you click. But that is for a different story.
It use to be way easier to access gated content under the first-click-free program. It is much harder to do that now under flexible sampling. But technically, anything plugged into the internet can, in some way, be accessed. Some are harder than others...