Be careful with your robots.txt

wkid · Aug 29, 2018

Hi guys,

While working on e-commerce websites we block junk/worthless URLs in the robots.txt file so Google won't crawl those pages and instead crawl our important pages(product pages, new content etc).

A client's website had ~600 pages indexed that were nothing but blog tags. We had it blocked adding a "Disallow: /tag/" rule in robots.txt. Big mistake. Turns out it won't be blocked unless the rule is "Disallow: /blog/tag/" . Luckily, I came across this and made the necessary changes.

How can you check if the directive that you have added in robots.txt is working correctly? Use the robots.txt tester in Google Search Console or you may even use a 3rd party tool such as https://technicalseo.com/seo-tools/robots-txt/

TL;DR - Always cross-check any new rule you might be adding to your robots.txt file in GSC's robots.txt tester or a 3rd party tool.

Samwise89 · Aug 29, 2018

It's so easy to screw up an entire project over something as basic as using relative URLs. I've been doing this for years now and still screw up relative URLs. "Should I use the forward slash up front or not? Do I use one period or two up front?"

The only reason I haven't destroyed a lot of CSS or even robots.txt is because I have enough sense to check the work, like you say to do.

wkid · Aug 29, 2018

Absolutely. Once I came across the above example, I ran a check on other websites as well.

Turns out Google is ignoring the disallow directive (correctly implemented) and the canonical tag (set to the product page) which has resulted in ~2700 junk pages being indexed and crawled regularly. Worst of all, there is no way for Google to crawl these links because we stopped the generation of these links (a review plugin was causing it) some time back.

I have now removed the disallow directive and replaced it with a Noindex directive. Hopefully, this will work.

ApeRunner · Sep 8, 2018

@wkid You can use ScreamingFrog (SEO spider) to crawl your site to get an idea how a robot navigates your pages.

It's free for crawling up to 500 pages on a single site: https://www.screamingfrog.co.uk/seo-spider/

ryandiscord · Sep 8, 2018

You could also do:
Disallow: */tag/

This would match any url with "/tag/" in it

Be careful with your robots.txt

wkid

Samwise89

wkid

ApeRunner

ryandiscord