Google's John Mueller addressed the topic of page pruning again, but this time explains that Google can only judge the overall site quality based on the pages they have of that site that is in their index. So if you block Google from crawling and indexing sections of your site via noindex, nofollow, Google won't consider the pages that are not in their index for the overall site quality.
John said this at the 20:05 mark into a video hangout from earlier this month:
Question:
About 15% of our crawlable pages have a noindex nofollow tag to avoid duplicate content and other low quality pages from indexing. Could this affect the overall site quality or does Google only consider the index pages when evaluating the quality of the site?
Answer:
Yes. We only look at the index pages when it comes to understanding the quality of a website.
But John goes on to explain that in this case, it probably makes more sense not to block the pages but to use the rel=canonical attribute to point those signals to a single page, instead of just killing off the page totally from Google. John added:
In general though, so one thing maybe just taking a small step back here you mentioned you're using this noindex as well for duplicate content. In general, I'd recommend using a rel canonical for duplicate content rather than a noindex. With no index you're telling us this page should not be indexed at all. With a canonical you're telling us this page is actually the same as this other page, yeah. And that helps us because then we can take all of the signals that we have for both of these pages and combine them into one. Whereas if you just have a noindex or if you block it with robots.txt then all of the signals that are associated with that page, that's blocked or it has a noindex on it, are essentially lost. So if someone were to link that page and you have it set to noIndex, like well they're linking to nowhere. Whereas if you had a rel canonical we would see that link going to their page to follow the rel canonical to the page you prefer to have indexed and use that one for indexing.
Glenn Gabe has an awesome tweet and GIF summing it all up:
Via @johnmu: Google only looks at indexed pages when evaluating quality for a site. You can nuke or improve low quality content, but for duplicate content, use rel canonical instead of noindex. Then Google can pass all signals to the canonical page: https://t.co/OZ2cWebV3J pic.twitter.com/idwpda74Bp
— Glenn Gabe (@glenngabe) April 17, 2018
Here is the video embed:
Forum discussion at Twitter.