Block Link and Rank Checkers

The Kloser

BuSo Pro
Joined
Jan 3, 2016
Messages
115
Likes
67
Degree
0
Is there a quick way to use robots.txt file or anything else to block ahrefs, moz, majestic and bunch of KW rank checker tools from crawling a site?
 
Disclaimer: I'm not very knowledgeable about server / .htaccess configuration so the following info might be wrong.

Some servers have mod_rewrite (the solution above) disabled, to make sure you can check with botsimulator.com if the user agent gets blocked (I found this out the hard way...).

If it's disabled you can try it with other modules (depending on if they are enabled) in your .htaccess file:

Code:
BrowserMatchNoCase BOTNAMEHERE bad_bot
BrowserMatchNoCase BOTNAMEHERE bad_bot

Order Deny,Allow
Deny from env=bad_bot

or

Code:
BrowserMatchNoCase BOTNAMEHERE bad_bot
BrowserMatchNoCase BOTNAMEHERE bad_bot

Order Deny,Allow
Deny from env=bad_bot
 
Use Robots.txt file with this - http://pastebin.com/dnkEeeEk

and

edit your htaccess file with this - http://pastebin.com/wGwHLUcZ

LOL, OMG, Anything redirect to webmd.com is from my ancient ancient thread at WF about this. This is an OLD version of this file. There are a ton of new bots on the scene too.

I will say this, you won't get past crawlers that don't play fair and use human user-agents instead of telling you they are "BLEXBOT". Ahrefs, SEMRush, and the big guys all respect the robots.txt (SOMEWHAT), and you can even block the wget guys with htaccess, but if someone is smart enough they'll just spoof who they say they are and pretend to be an updated browser.
 
I'm only concerned about the big guys right now - the bulk of the users.

@CCarter Will you get us an updated list of bots when you have a chance? Gracias!
 
Code:
BrowserMatchNoCase BOTNAMEHERE bad_bot
BrowserMatchNoCase BOTNAMEHERE bad_bot

Order Deny,Allow
Deny from env=bad_bot

Sorry, the second code was supposed to be

Code:
SetEnvIfNoCase User-Agent "BOTNAMEHERE" block_bot
SetEnvIfNoCase User-Agent "BOTNAMEHERE" block_bot

Order Allow,Deny
Allow from all
Deny from env=block_bot

Here are a few of the newer backlink crawler bots you might want to block:

JamesBOT (http://cognitiveseo.com)
SEOkicks-Robot (https://www.seokicks.de)
SearchmetricsBot (http://searchmetrics.com)
LinkpadBot (http://www.linkpad.ru/)
spbot (http://www.openlinkprofiler.org)
 
@The Kloser As someone noted here above me, some rewrite rules will not work with certain hosts due to different hosting setup, so it's extremely important that you test each and every site individually.

You could do that with the User Agent Extension:
for firefox: https://addons.mozilla.org/en-US/firefox/addon/user-agent-switcher/
for chrome: https://chrome.google.com/webstore/detail/user-agent-switcher-for-c/djflhoibgkdhkhhcedjiklpkjnoahfmg

If you don't want to mess with .htaccess and codes you could use a plugin to do the job.

There are free plugins out there and then there are paid stuff like Spyder Spanker.
For a free plugin you could use Link Privacy.

@CCarter Not all link crawlers obey robots.txt, I've had cases where ahrefs crawled a site even though there was a specific rule that disallowed it.
 
@CCarter Not all link crawlers obey robots.txt, I've had cases where ahrefs crawled a site even though there was a specific rule that disallowed it.
That's why I stated "SOMEWHAT"

Ahrefs, SEMRush, and the big guys all respect the robots.txt (SOMEWHAT)

I was in a slack group where I confronted the Ahrefs CEO about the fact they were using a "East European" country's ISP to piggy back into a Private Blog Network which was specifically disabling Ahrefs in the robots.txt AND within .htaccess file. They were cloaking their user-agent and location, and indexing these sites which were specifically hidden FROM THEM. The CEO was denying it, but evidence was evidence, hard to disprove when you are seeing the logs, the blocking files, and then Ahrefs indexing these pages which clearly was against the robots.txt rules.
 
Creating specific exclusions in robots.txt, .htaccess, nginx.conf, etc. is a fun exercise, and still a decent best practice. That being said, people will probably get better long term results as well as more scalable results in developing systems/methods to monitor traffic behavioral characteristics, and simply blocking traffic that exhibits certain characteristics. Of course, once you reach that point and begin to think about, "How would I scrape my site???", you might then come to realize there is NEVER a foolproof method. There are a lot of creative ways to build bots, that most would never be able to detect.
 
That's why I stated "SOMEWHAT"
I know, I just wanted to sharpen the fact that there are cases where they don't.
TBH I don't even use robots.txt directives as I think it leaves a footprint and it makes it pretty obvious that you're trying to hide.
I was in a slack group where I confronted the Ahrefs CEO about the fact they were using a "East European" country's ISP to piggy back into a Private Blog Network which was specifically disabling Ahrefs in the robots.txt AND within .htaccess file. They were cloaking their user-agent and location, and indexing these sites which were specifically hidden FROM THEM. The CEO was denying it, but evidence was evidence, hard to disprove when you are seeing the logs, the blocking files, and then Ahrefs indexing these pages which clearly was against the robots.txt rules.
Glad you brought that up, it just shows how ineffective it would be to rely on just the robots file as these crawlers often disregard it.
 
Back