Is this a sign of indexation bloat?

luxer

In the midst of chaos, there is also opportunity
BuSo Pro
Joined
Sep 8, 2016
Messages
90
Likes
38
Degree
0
yxPcXVb.png


As you can see the number of indexed pages skyrocketed. Did not build more backlinks than normal (not a link velocity issue) or make any changes to the website during this period.

This is an authority website, the traffic has not changed its within normal parameters for the time of year.

I am unsure what to make of this jump..

This is an Ecommerce Shopify site that has 5000+ products with up to 5 variants on some products.

Since I noticed this index jump I started to make some changes to try to deal with it (if it is indeed indexation bloat)

To deal with what might be bloat (I had previously been keeping sold out product pages available, and sometimes recycling the pages with new similar products) many of these pages rank for long term keywords and have a lot of links pointing at them.

I have started deleting and redirecting these pages to related non sold out products.

Any feedback would be more than appreciated!
 
Last edited by a moderator:
All I can think of to create that huge of an additional number of pages, blocked and not blocked, is that you've been hacked and posts are being created in bulk with tags and in categories that aren't visible on the site but are being found by Google through ping lists or something.

Have you done a site: search and checked out the indexation? Set the time period to the past month or week or whatever and find out what kind of new pages are being introduced.
 
Do a site search on Google like Ryuzaki mentioned. I'd also do Bing as well. You may have to get creative with your site search operators to filter stuff out and try to focus it. I usually try excluding entire subdirectories, subdomains, etc.

My first concern is always hacking, but then I am paranoid. Searching based on a recent time frame is good, but you'll also want to expand that time out to several months. I've had WP sites hacked before, where they waited 4-5 months after the hack, before they started generating pages. Considering that, I'd also search for some common affiliate terms broadly, and anything that's particularly out of the ordinary for your site's content/products.

On Google Search Console, have you checked the URL Parameters section? That one is often overlooked, but can be extremely helpful when it comes to complex sites. See if there are any oddball parameters that show a ton of URLs being tracked.

Also on GSC, check the site crawl section. Does the reported crawl rate appear to correlate to the indexing changes? That won't necessarily tell you anything specific, but good to confirm.

Google currently has parts of their new version of GSC in beta. I'm testing it for one domain. From what I've seen with it so far, it appears they're making improvements to things like timing, data update frequency, etc. So I wouldn't be surprised if we see some "different" crawling/indexing behavior these past few months and beyond.

The last thing I'll mention are the obvious technical issues. With any complex site, especially an ecommerce site, it's just way too easy. Pagination, filters, parameters, archives, category pages. There's always the possibility that this is nothing nefarious, and that there was simply a lurking technical issue all this time. Googlebot and Bingbot have a tendency of finding these deficiencies and turning them into big problems. I definitely recommend checking over all the typical best practices for these things, like robots.txt, meta robots tags, canonical tags, nofollowing certain internal links, etc.
 
All I can think of to create that huge of an additional number of pages, blocked and not blocked, is that you've been hacked and posts are being created in bulk with tags and in categories that aren't visible on the site but are being found by Google through ping lists or something.

Have you done a site: search and checked out the indexation? Set the time period to the past month or week or whatever and find out what kind of new pages are being introduced.

Using the site operator for that time period and recent no strange pages are showing up in either Bing or Google. Another 30k just got indexed in the last 5 days..

Do a site search on Google like Ryuzaki mentioned. I'd also do Bing as well. You may have to get creative with your site search operators to filter stuff out and try to focus it. I usually try excluding entire subdirectories, subdomains, etc.

On Google Search Console, have you checked the URL Parameters section? That one is often overlooked, but can be extremely helpful when it comes to complex sites. See if there are any oddball parameters that show a ton of URLs being tracked.

Also on GSC, check the site crawl section. Does the reported crawl rate appear to correlate to the indexing changes? That won't necessarily tell you anything specific, but good to confirm.

The last thing I'll mention are the obvious technical issues. With any complex site, especially an ecommerce site, it's just way too easy. Pagination, filters, parameters, archives, category pages. There's always the possibility that this is nothing nefarious, and that there was simply a lurking technical issue all this time. Googlebot and Bingbot have a tendency of finding these deficiencies and turning them into big problems. I definitely recommend checking over all the typical best practices for these things, like robots.txt, meta robots tags, canonical tags, nofollowing certain internal links, etc.


1) What is the command to remove a exclude a subdirectory with the site operator?

2) Checked the URL parameters section and maybe found something strange.

rfsn _escaped_fragment_

This might stand for Refersion (a program we use for affiliate sales. I am unsure what this means exactly?

The crawl rate seems to correlate
qbqco7k.png


3) Robots text is not something you can edit if you are a Shopify user they maintain control of it. I made some changes since this problem to exclude sold out products using meta robots tags. I expected this to drop the indexation. It has since jumped 30k

4) I did receive a strange message from Adwords last month

Your ad isn't running right now because it's disapproved for violating
Google's advertising policies. If you address the policy violations below,
we'll take a look and see if it can start running again.

=====================
Policy violations
=====================
Malicious or unwanted software: To help ensure the safety and security of
our users, we've disapproved your ad because it contains malicious software
(malware) or because your landing page is known to host or distribute
malware in violation of our policies. We strongly encourage you to
investigate this issue immediately in order to protect yourself as well as
your customers. To run your ad, follow these instructions to check your
computer for malware, remove all malicious code from your ad and site, and
submit your site and ads for review:
https://support.google.com/adwordspolicy/answer/6020954#311

5) I track 1000 keywords and none have deviated much from their current placement with the exception of two 1.5m ones. One got knocked out of the Serps the other is bouncing around but it has gone from a solid 1st place to out of the Serps back to 11.
3Obw6d9.png


I have reached out to a friend that works at Google waiting to hear back from him. Honestly I am a bit perplexed..

Thanks for your insight and assistance!
 
Last edited by a moderator:
As far as excluding subdirectories, you can just chain operators like this:
Code:
site:example.com -site:example.com/category1/ -site:example.com/category2/subcategory/

OR

site:example.com -site:www.example.com -site:api.example.com
(say you have multiple subdomains and want to exclude several to focus your search)

That is concerning about the AdWords notice. I'm no expert on AdWords violations, but the timing seems a bit too coincidental to me...

As far as the listed GSC URL parameter, did it say a number of URLs were being tracked? If so, is it a lot? You can use some of the settings to specify what the parameter does and also disallow crawling, but I don't know if that's related to this issue or not. Just BEWARE, it is potentially easy to create the wrong settings in that section, and block crawling important parts of your site if you're not careful. I'm not familiar with Refersion, so don't just block those URLs without confirming with someone else.

Maybe try a site search combined with exact match strings like "rfsn" or "rfsn_escaped" and see if a bunch of URLs show as indexed.
 
@turbin3 Definitely the timing is a little strange. I just found this error in the schema.org on one item.

1BlFS36.png


Could this cause the indexation error?

Shopify level one support was unable to answer they kicked it up to tier 2 waiting to hear back what is causing this error on their end.
 
Last edited by a moderator:
I would check if your structured data is being duplicated. Use the Structured Data Testing Tool. Is the site using both microdata and JSON-LD?

Also, random question. Is your site using hreflang tags, and/or does it have multiple language versions present?

Any recent changes to the theme, plugins, etc.?

Something I've noticed over the past 2-3 months is Google changing several things about how structured data is handled/recommended, at least in certain verticals. I noticed at least 2 updates, IIRC, to the structured data testing tool, at least for one schema vertical. The short explanation is, they seemed to start adding supplementary schema as "recommended" for several important content types. In other words, "Hey bro. You really need to add this content and schema to your site....you know, so we can scrape it and steal your data for Knowledge Graph. Thanks bro."

That plus in another vertical, what was valid schema before, on certain page types, all of a sudden became no longer approved and would no longer validate with the testing tool. End result being needing to pull that schema off hundreds of thousands, or millions of pages, and get them freakin re-crawled to reflect no more "violations". Freakin unlicensed "traffic cops" if you ask me. :wink:
 
@turbin3 "Is the site using both microdata and JSON-LD? "

We use JSON-LD, school me on microdata?

We are not using hreflang tags currently. It is in the works. Setting up sub domains and trying to translate 10's of thousands of products, blogposts etc. not sure its worth the effort or potential changes in Google.

As for changes the theme is constantly in flux and yes the newest additions are search filter apps.

Traffic cops.. :( could not agree more.

“Those who cannot change their minds cannot change anything.”
 
Microdata is just another type of structured data. Specifically, it is attributes placed directly within the appropriate HTML elements, to help clarify what each one is. Microdata is a lot messier to implement, especially on a large site.

That's why JSON-LD is so nice, since you can just have a single script tag that contains all the structured data for the whole page. Since you already have JSON-LD, you're covered there.

As far as hreflang tags, once interesting thing I found out is, supposedly they count towards a site's crawl budget. I mean this just makes sense that they're going to crawl multiple language versions, if they exist.

Consider this, though, unless it doesn't apply and your site is multilingual. I had one large site where I was trying to max out on-page factors as much as possible, just to see how it performed. It was a single language site. I added hreflang tags, but basically referencing the same page, almost like a canonical.

Over a period of 3-4 months, I noticed no significant improvement as a result of this. I mean, I didn't really expect one, but thought it would be interesting to test. What I did notice, however, was that set of pages seem to decline in crawl volume. I don't have any proof of this, and it may not even be a thing, but I suspect the self-referencing hreflang tag may have somehow cut into that site's crawl budget. I ended up removing the tag. Again, this may not apply to your site or anyone else', but I thought it was interesting.

Your mention of new search filters is curious. Any sort of UI / functionality changes like that would be one of areas I'd focus on heavily. Especially on e-commerce sites, it is way too easy for one or two minor technical issues to generate tens, hundreds of thousands, or even millions of duplicated page versions.
 
Have you tried running a technical crawler through your site like Screaming Frog, Deep Crawl or SiteBulb?

They should be able to crawl and show you how many indexed pages they see from crawling the site and what is in your xml sitemap.
 
@secretsauce I have run Screaming frog in the past. Not a bad idea to do it again.
@turbin3 Going to dig into the search filter and see. The errors started 2 months after the search filter install.
 
Search filters would be my guess. Check the source code for canonicals and see if URLs with parameters are being canonicalized to themselves, or if they have no canonical. Ideally, in many cases you'll want the URL + parameter pages to be canonicalized to just the URL without parameters.

Also, in many cases you might want a robots noindex,follow tag. In some cases, where you're dealing with things at massive scale and have to be very careful of crawl budget, you might want robots noindex,nofollow instead. Despite the absolute statements many SEOs make about "you shouldn't nofollow anything internal", it can be critical to do on sites at scale.

I'd also run a few URLs through the Fetch As Google tool and see what you see.

Another interesting thing you might try is doing the same site searches on Bing. You can even play around with things like their version of the "inurl" search operator:
Code:
site:example.com instreamset:(url):mykeyword

I've found Bing to actually be very aggressive with their crawl rates on some larger sites. So in some cases, they might even find site issues a bit faster than Google, or at least oddball issues that might not show up on Google. Generally, their crawler logic seems to be a bit less advanced. Guess they make up for it with quantity sometimes. :wink:
 
luxur, I don't know. I've never used Shopify. You can look at the source code of any search page and see if it has a meta tag for noindex. If you fail to see anything there, you can look at the robots.txt and see if they disallow crawling on the search pages.
 
"Searches by default are not indexed by search engines." is this true?

There's never any guarantee of that, and that absolute statement of theirs would concern me if it was my site. With extremely large sites, things like search engines on site, filters, pagination, etc. can be a massive pain. You can cover most of the obvious bases with canonicals, robots.txt exclusions, meta robots tags, and nofollow attributes where possible.

The thing is, they will try to crawl anything and everything. Smaller sites don't usually see this end of the spectrum. The average blogger is happy to take anything they can get.

I'd recommend running multiple searches, playing with filters, and anything else you can do to generate a new URL. Grab a bunch of variations and start testing. Fetch as Google. Screaming Frog or Xenu Link Sleuth. I think Bing has a "Fetch As" tool also, but I forget. There's also:
  • DeepCrawl
  • Netpeak Spider
  • cognitiveSEO (I think they have one. Been a long time since I looked at them.)
  • Sitebulb sounds cool, but haven't had a chance to check it out.
 
I decided to make this a separate post so it doesn't blend in. Looking at your first GSC picture, something occurred to me. I've seen similar crawl/indexing trends a few times before, each due to different issues.

Notice the yellow "Blocked by robots" line? That's a heck of a lot being reported, and it's kept trending up. Something somewhere might be causing Googlebot to hit those blocked URLs regularly. This could be either external from your site, or internal links.

Internal
  • Resource files:
    • JS, CSS, PHP files linked for tracking or other purposes, or pretty much any other resource
    • If it's linked on page, they'll try to crawl it. Check Fetch as Google and see if it reports anything.
    • Also check the "Blocked Resources" section under Google Index on GSC. What, specifically, is it showing as blocked? Resource files or actual pages? I'd see if you can spot a trend.
  • Internal links without nofollow:
    • If there's a URL pattern that's crawlable, there may be followed links in your code on some obscure pages. Trust me, Google and Bing will find it and wear you out with the duplication.
    • Make a list of every possible unique page/URL variation on your site. Visit some of them and check the source for followed links that should be nofollowed.
External
  • FOLLOWED LINKS to pages you want blocked!
    • You may have your site locked down solid. Depending on your link profile, you might still feel some pain from the external factors.
    • Imagine gaining thousands of links from users sharing your search URLs on their blogs and social sites. You know, stuff like, "Hey girl, check out these hot shoes!" You can't control their links, and tons of them will inevitably be followed.
    • Google / Bing still find them, crawl them, whine about you blocking them, but continue to burn through your crawl budget regardless. Oh my God how I am all to familiar with this pain.
    • A nefarious SEO might even build those links to all the wrong pages, in the hopes of messing with your crawl budget... Not saying I've seen that or that it's necessarily a thing....but I'm sure someone out in the wild has done it. LOL
Anyways, don't overlook the external factors. Might be a non-issue for you though, depending on your link profile.
 
@turbin3 A lot to digest here.

There are over 1k affiliates. You got me thinking how easy would it be for even one of them to be firing up scrapebox and using their /affiliate-link on spammed out auto accept blog lists from fiverr. <facepalm>
 
Back