What Causes Google To Deindex Pages Of A Website?

miketpowell · Aug 29, 2015

I'm trying to understand what cause Google to decide to no longer index a given page on a website even if they are still indexing many other pages.

I'm looking at some large sites in a niche I'm in and some of them had millions of pages indexed last year. Now they only have a few hundred thousand. Whatever they did I want to avoid doing in my own SEO efforts.

Does anyone have any insights?

I always though Google just ranked pages way lower if the quality wasn't there - I didn't know they would just simply chose not to index millions of pages of content.

lion1978 · Aug 29, 2015

one thing that leads to deindexing of pages in google is copyright violations, but only if the copyright holder informs google of the violation.

Another is heavy and incorrect use of spam, this depends a lot on the niche though, at least in my experiance.

ther may be other factors that I am either not aware of, or have not thought off, but the two mentioned above is probably the main ones.

Ryuzaki · Aug 29, 2015

I doubt any site has millions of pages of quality content.

juliantrueflynn · Aug 29, 2015

At that many pages this is where the fun really starts with SEO and what got me so into it.

Where are you seeing the indexed numbers? Google reports different numbers everywhere, especially GWT.
This amount of pages means you need to evaluate what needs to be blocked with robots.txt vs. noindex (there's crawling differences). Here's a good blog post on crawling rate optimization that focuses on robots.txt. You may have a high amount of pages that don't need to be crawled. When I say 'don't need to be crawled' you will need to think like Google and how they group Similar Search Results (ex. check 'site:yourwebsite.com broad keyword'). Here's a screenshot of what I mean.
Your dynamic/conditional internal linking through has to be on point and I would focus heavily on this area (most likely though footer/sidebar). Siloing will be huge here.
Are you seeing a drop in any other metrics besides indexed pages?
Checkout to see if your inner pages pass Google's mobile tests, couldn't hurt to see if you're getting shit from that as well. This would affect crawling and somewhat mentioned in the blog post I linked above.

Google overall is showing a lot less indexed pages across all their tools/search. It's never really a metric I take a very close look at besides seeing if there's drops/issues. Crawl rate like I mentioned above is something I take a closer look at but not actively.

CCarter · Aug 29, 2015

Stepping back and looking at it from both sides of the field, blackhat and white hat - I noticed one thing in common with ALL pages that were de-indexed - Google wasn't sending any traffic to them.

Now stop and really think about it for a moment, think like them, and you'll realize that if we've got pages in our index that aren't useful, why keep indexing them? Just drop them. That's why I try to advise anyone that's working on a PBN model, microsite model, or whatever to always make sure you got Google visitor traffic going to the pages you want indexed, meaning they rank, otherwise Google literally has no use for them.

As well Google is not reporting 100% of their index to you. It's not even possible once you realize how their database works. No single query, queries EVERY SINGLE Database - ever, it's not possible, so it queries the most freshest and probably the nearest to you, so if you do a "site:example.com" search you might see X amount of pages, then someone half way around the world will see a completely different number.

Big data is not linear at Google's level, they have to keep data in different formats, datacenters, tables, and some have timed delay and expirations (see Google's BigTable data storage system). Google doesn't use a relational database structure (key => value), it uses a multi-dimensional structure (3D mapping) with (time: [key => value]). In theory it allows for backwards querying but also it allows versioning for comparison.

Now in this system it would only make sense to create expirations on useless data, and the #1 factor of determining useless data for a search engine would be if the search engine is sending traffic to that particular page - if not, why waste energy and time versioning that low level page? Answer: you wouldn't.

Just think about it from a common sense standpoint, Google would never de-index a page that it is sending traffic to (unless it's doing something against their TOS). So if pages are getting de-indexed they simply are junk traffic which Google has determined is useless by all means.

miketpowell · Aug 31, 2015

Hey thanks for the replies lots to look into and think about here.

First to clarify it's not my site that has had any de-indexing problems yet. We have very few total pages and haven't embarked on this SEO strategy that the competition has just yet.

The millions of pages will make a little more sense if I share the niche, it's records tied to someone's name. One of the sites that has managed to keep the page count high is whitepages.com , site:whitepages.com shows 12million+ results

The site:domain is how I'm searching for how many pages and I figured this was a crude but someone correlated method.

I know from talking to one of the competitors that the millions of pages they got de-indexed last year must have been getting them some traffic because it cut there organic traffic overall a good deal. But these are definitely pages that rank for very infrequent specific name searches.

Going to try to dig more into exactly what pages are indexed and which aren't on some of the competition to see if I can collect more data.

I figured it was mostly Google thinking the content is to spam/duplicate because I know of a number of these guys are just scrapping each other or creating the pages from the same DB. So if 20 sites have the same "John Taylor Smith" page with the same content on it(address, phone, age ect), would it make sense that Google de-indexes some of them?

juliantrueflynn · Sep 1, 2015

You're not going to want to rely on the advanced search operator 'site:' to determine your indexed pages count, for reasons like the screenshot in my previous post. I would rather follow GWT even though there's even issues with that. Another option is getting all your URL's and then checking with a tool like URL Profiler which can tell you if it's indexed or not -- that would be most accurate.

And to clear something up, it depends a lot if reducing pages would effect organic traffic. Often times if I'm taking over a site one of the first things I do is look at what to remove/consolidate. A lot of pointless pages will result in less crawling and effect site structure.

For what you're saying with same content across sites, yea that would make sense. We saw this push already for web directories and same content issues over there. It's not penalizing you though, think of one of the biggest search areas online--porn. Working in that area myself there's a lot of creative ways to keep crawling up so you don't run into those issues because so many people are just reposting the same pics, vids, and gifs.

What Causes Google To Deindex Pages Of A Website?

miketpowell

lion1978

Ryuzaki

お前はもう死んでいる

juliantrueflynn

CCarter

Final Boss ®

miketpowell

juliantrueflynn