Be careful with your robots.txt

Joined
Aug 28, 2018
Messages
10
Likes
5
Degree
0
Hi guys,

While working on e-commerce websites we block junk/worthless URLs in the robots.txt file so Google won't crawl those pages and instead crawl our important pages(product pages, new content etc).

A client's website had ~600 pages indexed that were nothing but blog tags. We had it blocked adding a "Disallow: /tag/" rule in robots.txt. Big mistake. Turns out it won't be blocked unless the rule is "Disallow: /blog/tag/" . Luckily, I came across this and made the necessary changes.

How can you check if the directive that you have added in robots.txt is working correctly? Use the robots.txt tester in Google Search Console or you may even use a 3rd party tool such as https://technicalseo.com/seo-tools/robots-txt/

TL;DR - Always cross-check any new rule you might be adding to your robots.txt file in GSC's robots.txt tester or a 3rd party tool.
 
It's so easy to screw up an entire project over something as basic as using relative URLs. I've been doing this for years now and still screw up relative URLs. "Should I use the forward slash up front or not? Do I use one period or two up front?"

The only reason I haven't destroyed a lot of CSS or even robots.txt is because I have enough sense to check the work, like you say to do.
 
Absolutely. Once I came across the above example, I ran a check on other websites as well.

Turns out Google is ignoring the disallow directive (correctly implemented) and the canonical tag (set to the product page) which has resulted in ~2700 junk pages being indexed and crawled regularly. Worst of all, there is no way for Google to crawl these links because we stopped the generation of these links (a review plugin was causing it) some time back.

I have now removed the disallow directive and replaced it with a Noindex directive. Hopefully, this will work.
 
You could also do:
Disallow: */tag/

This would match any url with "/tag/" in it
 
Back