Linking to Internal Pages Blocked by Robots.txt, Nofollow?

animalstyle · Aug 2, 2017

I have a handful of pages that I've set to disallow in my robots.txt. For example I've blocked the page where a user leaves a review for a specific location in my database.

Now obviously I have to link to this page so people can write reviews. The page is populated dynamically based on the link you click from each individual page. I don't want people landing on these pages from the search results.

I've just had a bunch of crawl errors for these pages pop up in search console.

My question is:

Should I be rel=nofollowing the internal links to these (and similar) pages?

Calamari · Aug 2, 2017

animalstyle said:
Should I be rel=nofollowing the internal links to these (and similar) pages?

You're going to get two totally different answers on this. If you want to err on the side of caution I would nofollow them but I don't personally think it makes a difference so I don't bother futzing with it.

Ryuzaki · Aug 2, 2017

I don't no-follow any internal links, because even if you ask them not to crawl, they will. Even if you ask them not to index those pages, they'll still crawl them. So you don't want to hinder that process anyways. You can use this to send crawlers where you want and even some page rank.

You definitely want to no-index anything dynamically created though. And even if you say not to crawl in robots.txt and no-index directives on the page, they'll still do it. I have folders and sections on my site that I have to occasionally tell Search Console to drop out of the index. I think it lasts 90 days and you have to do it again, which is kind of dumb.

ryandiscord · Aug 2, 2017

Ryuzaki said:
I don't no-follow any internal links, because even if you ask them not to crawl, they will. Even if you ask them not to index those pages, they'll still crawl them. So you don't want to hinder that process anyways. You can use this to send crawlers where you want and even some page rank.

You definitely want to no-index anything dynamically created though. And even if you say not to crawl in robots.txt and no-index directives on the page, they'll still do it. I have folders and sections on my site that I have to occasionally tell Search Console to drop out of the index. I think it lasts 90 days and you have to do it again, which is kind of dumb.

Pick robots.txt or noindex, combining the two won't work well. Google will see the robots.txt, not crawl the page and never see the noindex directive. Pages blocked by robots.txt will still show in the index.

Pages not blocked by robots.txt that have noindex won't be indexed but will still be crawled.

Another option is using search console to control which url parameters are crawled.

turbin3 · Aug 3, 2017

On relatively small sites (a few thousand pages or less), personally I wouldn't bother with nofollowing internal links. As Ryuzaki said, G / B basically do whatever they want, crawl whatever they want, in as little or as much volume as they want, often indexing things you don't want. Sometimes they seem utterly indiscriminate. Best to just add robots noindex tags on pages you truly don't want indexed, leave everything followed for G / B to sort out, and redirect or 404 things as necessary to eliminate duplicates. For small sites, you want as few roadblocks as possible so they can crawl in as much volume as frequently as possible.

For large or extremely large sites, the same logic may not and probably does not apply. In those cases, it's often necessary to "pull out all the stops" to try and bring G / B inline, or else they will take things to their natural extreme. Trust me on that one, I have horror stories. Crawl rate optimization on large-scale sites can be a very complex and serious affair that has massive implications, and often doesn't follow conventional logic.

Speaking of Google's Search Console and Bing Webmaster, if necessary and if your site has a lot of dynamic stuff you're worried about indexing, don't forget to take advantage of both of their "URL Parameter" tools. This will let you specify certain parameters you want ignored. Again, they're merely recommendations, and the evil G / B will do what they want, however I have seen it help in some cases.

JasonSc · Aug 3, 2017

@Ryuzaki

Ryuzaki said:
You can use this to send crawlers where you want and even some page rank.

Are you implying using internal nofollow links to sculpt the flow of juice? Or did I try to read between the lines there is nothing really there?

Ryuzaki · Aug 3, 2017

ryandiscord said:
Pick robots.txt or noindex, combining the two won't work well.

Agreed. I meant I have a combination of the two, but only one of each depending on pages versus folders. I worded that post really poorly.

JasonSc said:
Are you implying using internal nofollow links to sculpt the flow of juice?

I worded it poorly, nothing to read between the lines. What I meant was, if you have a section of the site set to no-index, you can still have it dofollow. For instance, on Wordpress, I no-index anything that's paged ( /page/2 ), but I leave it followed, because not only does it lead to my own posts but no-follow "leaks" out page rank anyways. It just doesn't end up anywhere, so you might as well cycle it around with dofollow and not have dead-ends for spiders. And you can also choose where to send it with some crafty if-loops or whatever the case calls for.

Linking to Internal Pages Blocked by Robots.txt, Nofollow?

animalstyle

Calamari

Ryuzaki

お前はもう死んでいる

ryandiscord

turbin3

JasonSc

Ryuzaki

お前はもう死んでいる