In Google Search Console, did you ever see the "discovered but currently not indexed" notice regarding your index coverage status report? Google says in their help document that it might be related to an overloaded server, but Google's John Mueller says otherwise in this video hangout.
At the 11:17 mark, John Mueller from Google answered what it means if a nice portion of your URLs are labeled with the "Discovered - currently not indexed." Despite what it says in the help document, which is:
Discovered - currently not indexed: The page was found by Google, but not crawled yet. Typically, Google tried to crawl the URL but the site was overloaded; therefore Google had to reschedule the crawl. This is why the last crawl date is empty on the report.
John said either it can be about auto-generating too many URLs by accident, a poor internal linking structure or about reducing the number of pages to make the overall site stronger. In most cases I've seen, it was about reducing the number of pages to make the site stronger overall.
Here is the transcript, it goes on pretty long:
In Search Console 99% of our pages are excluded with “discovered but currently not indexed.” It's been done like this for a number of years even though we have links from some high profile newspapers and websites. What could be causing this and what could I do to get these pages indexed? It could it be a problem that Google isn't able to crawl the high number of pages that we have?So in general this sounds a bit like something where we're seeing a lot of pages. And our systems are just not that interested in indexing all of these where they think maybe it's not worthwhile to actually go through and crawl and index all of these. So especially if you're seeing discovered but currently not indexed that means we know about that page, that could be through a sitemap file, it could be through internal linking. But our systems have decided it's it's not worth the effort, at least at the moment, for us to crawl in index this particular page.
And especially when you're looking at a website with a large number of pages that might be a matter of something as simple as internal linking not being that fantastic. It could also be a matter of the content on your website maybe not being seen as absolutely critical for our search results. So if you're auto generating content, if you're taking content from a database and just putting it all online, then that might be something where we look at that and say well there's a lot of content here but the pages are very similar or they're very similar to other things that we already have indexed, it's probably not worthwhile to kind of jump in and pick all of these pages up and put them into your search results.
So what what I generally recommend doing there is, first of all if you're really seeing 99 percent of those pages not being indexed. I would first of all perhaps look at some of the technical things, as well. So in particular that you're not accidentally generating URLs with kind of differing URL patterns, Where it's not a matter of us not indexing your content pages but just getting kind of lost in this, I don't know, jungle of URLs that all look very similar but they're the subtly different. So things like the parameters that you have in your URL, upper lower case, all of these things can lead to essentially duplicate content. And if we've discovered a lot of these duplicate URLs, we might think well we don't actually need to crawl all of these duplicates because we have some variation of this page already in there. So kind of the the technical thing is the first one I would look at there . And then the next step I would do here is make sure that from the internal linking everything is actually ok. That we could crawl through all of the these pages on your website and kind of make it through the end. You can roughly test this by using a crawler tool or something like screaming frog or deep crawl. They're a bunch of these tools out there now and for the most part I think they're they do a really great job. And they will tell you essentially if they're able to crawl through to your website and kind of show you the URLs that were found during that crawling. And if that crawling works then I would strongly focus on the quality of these pages. So if you're talking about 20 million pages and 99% of them are not being indexed then we're only indexing a really small part of your website. That means perhaps it makes sense to say well what if I reduce the number of pages by, I don't know, half. Or maybe even reduce the number of pages I have to 10% of the current count. By doing that you can make those pages that you do keep a lot stronger. You can generally make the quality of the content there a little bit better by having more comprehensive content on these pages. And for our systems it's a bit easier to look at these pages and say well these pages that we have now which might be 1 million of your website for example actually look pretty good, we should go off and crawl and index a lot more.
So those are kind of the three directions I would take there. First make sure that you're not accidentally generating too many URLs. Make sure that the internal linking is working well. And trying to reduce the number of pages and kind of combine the content to make it much stronger.
Here is the video embed:
So if you have ever run into this, how did you solve the problem?
Forum discussion at Google+.