Robot.txt Overrides Meta Tags

Ryuzaki

女性以上のお金
Staff member
BuSo Pro
Digital Strategist
Joined
Sep 3, 2014
Messages
3,574
Likes
6,710
Degree
7
A heads up for the builders.

I'm in the process of doing massive behind the scenes work for my case study site here, and part of that is crafting a giant robots.txt.

Building a database'd site with templates means you're going to have one Header file being pulled for every page. I hate plugin dependency and don't really trust myself to create a custom set of tables for the purpose of slapping in customized meta tags in the header. The best I could do is create a sophisticated if/elseif/else loop to add tags into the right pages. That's dumb to me at this level of the game.

So I thought, hey, I'll just tell the search engines to not crawl these pages I don't want indexed, and i'll do so in the robots.txt....

It's not going to cut it. For example:



About.com told Google specifically, don't crawl this folder. Don't crawl it should mean "don't even look at it." Yet look at this:



Google went ahead and indexed the 2760 urls from that folder. They aren't showing the content, title, descriptions, etc., but they are indexing the URLs.

Do they show up for any legit search terms? Probably not. Could our sites get caught in some stupid Panda crossfire due to something like this? Most likely.

So my heads up is to definitely make sure you use:

<meta name="robots" content="noindex">
If you want to make sure a page isn't indexed. And if you do this, don't give the same directives in robots.txt or it will take precedence and google will ignore the meta tags. My guess is that, since they are choosing to disobey robots.txt directives (guaranteed they crawl the pages for their own data, as the pictures above essentially prove), they see the meta tags and ignore them as well.

But they seem to not ignore the meta tags when the robots.txt isn't "confusing" their crawlers. They crawl and follow but will at least not index.

TL;DR
Don't double up duties in the robots.txt and meta robots tags. Use meta tags when you can. Yoast's Wordpress plugin makes this easy in Wordpress for instance. Find a solution and do it right, or you'll end up like About.com. Search engines don't have to obey robots.txt or meta tags, so try to figure out which they are choosing to respect and go with that. In this case, Google will respect meta robots tags as far as indexing goes as long as you don't double up in the robots.txt.
 

emp

BuSo Pro
Joined
Nov 7, 2014
Messages
587
Likes
591
Degree
2
Problem is that both robots.txt and meta tags do not have to be honored.
It is considered "good manners" but it is not binding for any robot or user agent.

::emp::
 
Joined
Nov 29, 2014
Messages
48
Likes
43
Degree
0
Frustrating, and the last paragraph mentions it here on g support (cached version, live down for me). Ive also found 301 ing to a blocked url slaps it in the index.
 

Stephen

Ecommerce SEO / SEM
BuSo Pro
Joined
Jan 21, 2015
Messages
222
Likes
118
Degree
1
What you don't know of course is when that specific robots.txt direcitive was added, Google will index that shit until they find the file updated.

Pro-Tip. Don't add Google+ button to any pages you don't want indexing, specifically dev sites to the retard devs who can't build their development sites properly!
 
Joined
Feb 21, 2015
Messages
31
Likes
13
Degree
0
Pro-Tip. Don't add Google+ button to any pages you don't want indexing, specifically dev sites to the retard devs who can't build their development sites properly!
You only need a plugin/addon that uses any Google services in the background, like translation, PR check etc. Back in the day my favorite trick to get new pages indexed without a hassle was to do a simple pagerank check for them. Google's toolbar was a mighty effective way to crowdsource index building.
 
Joined
Jul 22, 2014
Messages
71
Likes
33
Degree
0
My guess is that, since they are choosing to disobey robots.txt directives (guaranteed they crawl the pages for their own data, as the pictures above essentially prove), they see the meta tags and ignore them as well.
I've had sites where Google has ignored both the Robots text file and the Robots meta tag, together and individually. One other option is to make sure you exclude these pages from your sitemap, and also to specifically exclude them in Webmaster Tools. But in the end, Google does what it wants.