Crawl Depth, Crawl Budget, Sitemaps, & Interlinking

Ryuzaki · Jan 12, 2018

I've been spending a chunk of time this month getting some lingering site build things out of the way. This should mark my site moving to it's final form and I'll never have to think about these things again. But of course, opening one can of worms leads to another.

As I've been doing some 301 redirects and extending the length of some content, I began to think about interlinking since I was changing some links from old URLs to new ones. I ended up reading this post about Pagination Tunnels by Portent that was interesting if not necessarily useful.

You don't have to click the link to know what's in it. They set up a site with enough content to have 200 levels of pagination in a category. Then they tested several forms of pagination to cut the number of crawl steps down from 200 to 100, and ultimately to 7 using two types of pagination. One was the kind you see on some forums like this:

The other had the idea of a "mid point" number like this:

Where 12 and 113 are the mid points. These cut the crawl depth down to 7 leaps. That ends up looking like this:

But I'm guessing that Google's spiders aren't real thrilled about even 7 steps.

I don't plan on doing any crazy pagination tricks. I don't necessarily plan on changing much anything other than interlinking.

The reason for that is our sitemaps act as an entry point, putting every page at most 2 steps away from the first point of crawling. Would you agree with this? If you put your sitemap in your robots.txt, Bing and everyone else should find it.

But, for the sake of discussion and possibly enhancing our sites, let's say that the sitemap doesn't exist, and that there's no external backlinks, and you want to solve this the best you can with interlinking. Without interlinking some if not most pages are basically going to end up being orphans.

Do you think there's any benefit to ensuring every single page is interlinked contextually to another one at least one time, or is that just anal retentive thinking? Assuming every single post is optimized for at least one term if not a topic and could stand to bring in even the tiniest bit of traffic per month, is this even worth the bother?

Of course we intend to interlink more to the pages that earn the money. Are we harming the page rank flow by linking to every post once, or enhancing it? Assuming that once Google gets a post in it's index (we're pretending sitemaps don't exist here), it'll crawl those once in a blue moon. Interlinking should ensure each page is discovered and crawled again and again.

I'm not suggesting we make some absurd intertwined net of links. We'd still do it based on relevancy or link to the odd post from the odd post where there is no real relevancy.

The possibly benefit would be ensuring Google indexes the maximum number of pages possible, which will have sitewide benefits related to the size of the domain and the internal page rank generated. The downside is flowing page rank to less important pages a bit more.

Also, what do you suppose Google's crawl depth really is? How many leaps will they take from the starting point?

And finally, do you know of a spidering software that can crawl while ignoring specified sitewide links like navigation, sidebar, and footer, and category links after crawling and tell you which posts are essentially orphaned?

This is a pretty scattered post, not well thought out, but it should be worth talking about, especially for eCommerce and giant database sites.

turbin3 · Jan 13, 2018

You know, it's funny how the human mind works. Seemingly small things, like a smell, can transport your mind instantaneously back to some past experience as if it happened yesterday. In that same vein, this post just made me relive, in seconds, the past several years of work on a few massive sites. Funny how that works! :wink:

This is a subject close to my heart, and one I've spent the vast majority of my time trying to master over the past several years. In that time, I've waged a ruthless war against duplicate content. I've been battling crawl budgets like there's no tomorrow. I've also been struggling to maintain my own sanity.

One of the things I've put a significant amount of effort into over the past few years, is modeling bot crawl behavior across several of my sites. I'm talking big data scale, billions (literally) of rows of traffic log data. It's actually brought to light some REALLY interesting learning points. Mostly, it's just bolstered my paranoia about locking down site structure mercilessly, until there is only one true way for search bots to crawl. That's the dream at least. LOL

Not Everything is Relative
Dealing with extremely large sites is just a whole other ball game, the best practices of which often fall contrary to popular "Moz" fare. That's something most people just don't get. Some of what works on a 100 page blog, might actually be a TERRIBLE thing to implement on a 1M+ page site that has near-infinite possible crawl paths.

Anyways, I'll try to be concise, since I feel your pain. First off, I feel there's probably some sort of threshold at which certain on-page or domain-level factors have their ranking coefficients significantly reduced or elevated.

Take blog / WP taxonomy pages, for example. On a small site, it might make sense to allow the first page of your taxonomy pages to be indexed, while robots noindex/following the rest. It's trivial to make some template tweaks and even throw some supplemental content on those archive pages to enhance them.

I can see how people justify that. Even though the possible downsides may not matter a whole lot on a small site, I can say that it's something I will NEVER be doing again on any of my small sites. I just wouldn't be able to sleep at night, knowing there is actually a single, non-canonical, crawlable page on one of my small sites. Maybe this is what PTSD from big site SEO looks like. :confused:

Stripping Site Structure to the BONE
On large sites, I've pretty much come to the conclusion that it's probably best to physically remove ALL possible low quality crawl paths. This could be actual pages. In some cases, URL variations like query strings. Maybe from facets or other things.

Trying to "game" the blocking and get creative, like using JS links, doesn't work either. They'll still be crawled, trust me. If the entire <a> tag is generated with JS, it's probably gonna get crawled. If the <a> tag is hard-coded, and just the href is JS-generated, you can hardcode nofollow and that will help. Though, best to simply not have that pathway at all, if it's avoidable.

Think about the nature of stuff like that at scale. Like say I have a widget with a couple dozen links I don't want crawled. Now say that widget is sitewide or on a significant number of pages. Sure, you can do like the above, hardcode nofollow, etc. At scale it still means a percentage of on-page, nofollowed (link-juice-losing) links on a huge number of pages. This stuff can get to be a parasitic drain, in my opinion, at least on large sites where every little bit counts.

I'd really like to be able to define terms, to make more sense of this for everyone, but it's extremely tough. What is a large site? It's tough to say. I mean, at the point a site has 100K to 1M+ pages, yeah, that's a large site in my opinion. 10K pages? I'd still call that "large", if you think about the relative page volume compared to the average site (blogs, SMB sites, etc.). Some might not.

Concerns For Crawl Budget
As far as the "budget" goes, I think it's important to try and establish trends for the niche, to figure out what type of crawl behavior is desired or should be prioritized. Here's some of the major factors:

Crawl Frequency

Check competitor's indexed equivalent pages over time. Use Google's cache dates to get a sense of how frequently their core pages might be getting crawled.

Do some analysis on your own traffic logs. Take a subset of pages, like maybe a subdirectory, if that makes sense for your site. Notice any trend in how frequently the pages are getting crawled?

In my case, in at least one niche I noticed frequency was extremely important. Certain subsets of deep pages (like 5-7+ click depth) were only getting crawled once every 2 or 3 months. This was horribly affecting rankings, when some competitors pages had content updates almost daily, and crawl frequency every day or few days.

SERPWoo is a lifesaver for correlating crawl frequency behavior with ranking trends. This let's you actually identify ballpark behavior for the niche. Also helps understand just how frequently a set of pages needs to be crawled to rank in a stable manner. I've been able to correlate this across multiple SERPs, and see some very interesting behavior. Stuff like infrequent crawls leading to rankings that look like the EKG of someone about to flatline. Like declining, a page crawl, blip (YAY! Our page is catching it's breath.....oh wait now it's dying again!).

f1fde0d3d4f56306166c531097605a7b--tv-episodes-futurama.jpg

Crawl Depth

URL parameters, filters, facets, archives, date archives, protocol and subdomain variations, ALL need to be considered and accounted for on a large site. For example, spammers, affiliates, negative SEO'ers and others can just link indiscriminately to your site. Does your server setup account for these factors, or might there possibly be some gaps?

Imagine page sets generated based on URL structure or query string. A site might use this to try dynamically building pages to capture maximum users, offering default logic as a fallback. In essence, always providing some sort of default content at a minimum, to try and capture and retain the user. Search engine-type stuff.

Now imagine someone sees your site is setup in this way. If they're particularly mean, maybe they might build a bunch of links to made up URLs on your site, with payday loan or porn keywords. And whadda' 'ya know?! Google crawls them, and maybe even indexes some of them! Worse yet, say there's unaccounted for link control on page...

Maybe those generated SERP pages (that's effectively what they are) happen to have logic that generates supplemental links on page, based on the query. And maybe those link placements were overlooked before, and happen to not be nofollowed and not restricted by robots.txt. A day or two later, and maybe now you have 30-40K+ payday loan and porn SERP pages indexed, because they happened to find links with logic generating
1. Original query string
2. Query + US city == a brand new page!

Click Depth

On a few of my larger sites, I noticed at least 2 distinct crawl behaviors.
1. High click depth pages ignored
  - There is logic behind Googlebot that prioritizes certain page types or site sections, while reducing importance of others. Exactly how that behavior is determined, is anyone's guess. Maybe it's a combination of on-page factors, user traffic and engagement metrics, or who knows what else. I'd bet other factors are history of page updates, as gathered by crawl history (maybe related to the tech behind the rank transition platform?).
  - What I haven't heard many people talk about is educated guesses about what site/page characteristics may or may not contribute to page sets falling into the lower importance category. I'm not even sure myself. Though, I want to say page quality + lack of traffic and/or lack of engagement are probably significant factors.
  - It's that thinking that's pushed me down the path, at least with certain large sites, towards consolidating and reducing page volume, to attempt boosting page quality and focusing traffic + UX on a smaller set of page in the hopes of positively affecting crawl behavior.
2. High click depth pages crawled to their natural extreme
  - Surprisingly, I noticed this quite a bit from Bingbot. Like 10X crawl rate over Googlebot (Bill G. probably playing "catch up" hahaha).
  - Over the past 2-3 quarters, surprising and substantial increase in crawl rate by Yahoo's slurp bot. Didn't see a lot of logic to their crawl behavior. Guess they're trying to figure out what they're even doing now.
  - In several cases, massive drill down by Googlebot. Seemed to occur under some set of circumstances, though I could never quite figure out what. I'm guessing it was the appearance of the right, followable links, on the right page placement, such that they got prioritized and crawled to extreme depth.
Robots meta tag logic, use of internal nofollow, robots.txt, and other blocking methods are absolutely needed on, I would say, most large sites. Most people that say "you shouldn't" have likely never dealt with a large site before. It's just a different ballgame.

A major factor in reducing click depth is killing it with your category page game! I think about "categories" differently from the standard WP / blogger fare. If, for example, you have a huge number of pages under a category or sub-category, I would say you might look at creating MORE cat/sub-cat pages!

For example, say I have 200 pages in 1 cat. Maybe it makes sense to break that into 2+ sub-cats with ~100pg each? The thing with this is, it's far easier figuring out creative ways to get high-level internal linking, close to the homepage, from the cat/sub-cat level. Way harder to do with the low level. So maybe that helps take things from 7 levels deep, to 6, or whatever. I definitely believe reducing click depth on large sites is a significantly important factor to help improve crawl behavior.

TL;DR: Focus on the purple and green areas below, to reduce click depth ↓

Sitemaps
In my experience, there appears to be a "threshold" of sorts in behavior and usage surrounding sitemaps as well. I have some very real data behind this, and it's been interesting to say the least.

In one site optimization campaign, I was testing indexing and ranking of a large set of pages, to try and establish a baseline trend for a niche. I started with around 500K pages, so just over 50 sitemaps at max 50K URLs in each. This was unfortunately a volatile page set. For various reasons, these pages were not HTTP 200 and indexible 100% of the time. Some would lose content for various reasons, like user-generated content being removed. Based on the page logic, they had several conditionals when under a certain level of content:

Meta robots noindex/follow
Canonical to a more complete page on the topic
Different redirects under different conditions (301, 302, or 307)
Straight 404 in extreme cases

So if you think about that, we're already fighting crawl budget and frequency. We want to get our most valuable pages crawled consistently, and within at least a minimal frequency range normal for the niche. Maybe a volume like 500K pages well-exceeds those parameters?

In this case it did. So the result was, massive volumes of 404's and noindexed pages being crawled, and a smaller number of redirects. Whole 'lotta non-200 stuff going on! The result was, sporadic rankings, and poor crawl budget usage. Lots of budget burned up on non-existent stuff.

So I switched up the game. Chose a better quality set of pages that were more stable. Took a single sitemap of 50K or less pages and submitted those. What I noticed is, it seemed to take a bit of time for Google to "trust" the site again. A few weeks, maybe a little over a month. Though, the crawl rate and frequency definitely started increasing, and overall budget usage was MUCH more consistent. A nice, progressive, upward trend. Hell of a lot better than 50K pages crawled one day....then 100 pages crawled the next because there were 30K 404's the day before. LOL

So I say all that to say, don't rely on the sitemap alone. If we're talking large sites, like 100K+ pages, definitely don't rely on just the sitemap. Even 10K+ I'd say this still applies, but that's just me. I do believe it's important reducing click depth on site as well, and I don't think a sitemap alone is a complete solution for that. I'd suspect that part of the algorithm weighs internal links versus other discovery methods, and is a mix.

Internal Linking

This is EXTREMELY tough on large sites. You have multiple factors to consider:
- Click depth
- Internal anchor text distribution (definitely a thing, and so is anchor over-optimization for internal links)
- Keyword over-optimization. I suspect, on large enough sites, this even gets down to the URL level. On at least one site, I've seen behavior where duplicated words within the URL seemed to be detrimental. Probably something related to TF*IDF and massive over-use of sets of words.
- Expanding on the KW angles, I recently responded to a thread about possible over-optimization compounded by partial and exact match domains. Definitely a big factor to watch for on large sites.
Also, consider volume of internal nofollow and it's implications. For example, some sites might have no choice but to have certain blocked pages still accessible to the user and not bots (SERPs).
Now imagine crawlable pages having a significant percentage of nofollowed links to these blocked pages. You might find this on some ecommerce sites that are blocking parameterized URLs, but still using them as links for page content items.
Sure, you may be restricting crawling. But we know link juice is still lost from nofollow. So how much juice are those pages losing? In the ecommerce example, one option might be creating a new set of pages that can be crawled, and swap out the nofollowed links for those, so you get rid of a good percentage of those nofollow links across a large set of pages. New landing pages in essence.
Extremely large and complex sites absolutely should be using the "URL Parameter" tool within Google Search Console, to specify all their possible params. It can take a bit of setup if you have a ton of params, but it definitely helps. Bing Webmaster also has this feature ("Ignore URL Parameters" under Configure My Site).

turbin3 · Jan 13, 2018

So that was a lot, and a bit of a jumbled mess. :wink:

I'm passionate about the subject, to be sure. Figured I'd do a recap and address your questions directly, so it's a bit easier.

Ryuzaki said:
But I'm guessing that Google's spiders aren't real thrilled about even 7 steps.

I would say probably not. On at least a few large sites, I've consistently seen that even 5 steps, for a significant percentage of pages, may be too much. I've tried to push things closer to 3-4 where possible, to help mitigate any possible effect.

Ryuzaki said:
The reason for that is our sitemaps act as an entry point, putting every page at most 2 steps away from the first point of crawling. Would you agree with this? If you put your sitemap in your robots.txt, Bing and everyone else should find it.

For a large site, I would disagree that the sitemap alone is sufficient for serving as a signal of a "low click depth" starting point. It can help, but I suspect the bias is still towards internal link click depth. Meaning, maybe you get it all in the sitemap. But if half the site is still 7+ levels deep, that might be enough that they still consider the deep stuff low priority regardless.

I've seen this reflected in my own efforts, and several millions of URLs worth of sitemaps at any given time. Based on my traffic logs, I get this feeling there's sort of a threshold or point of no return where, submitting past a certain volume of URLs is kind of an exercise in futility.

Ryuzaki said:
Do you think there's any benefit to ensuring every single page is interlinked contextually to another one at least one time, or is that just anal retentive thinking? Assuming every single post is optimized for at least one term if not a topic and could stand to bring in even the tiniest bit of traffic per month, is this even worth the bother?

I think there is, but I also think there's at least one other consideration. Namely, reducing page volume by consolidating pages, if it makes sense for the use case. I don't have any memorable data I can pinpoint, since I've been analyzing way too much at this point.

Though, I've analyzed enough in my own logs to suspect that, past a certain point, those few internal links you might build on a real deep and/or orphaned page, might not matter much if at all. So in some cases, reducing the depth, or the volume of pages at that depth, might be more effective.

Also, think about it programmatically, if we're talking about thousands of pages. I'd look at maybe creating a relational table of seed keywords. Maybe come up with some groupings, prioritized in some way. I'm still figuring out effective ways to do this at scale, while still targeting down to the topic-level. The idea is, take some consistent and identifiable element per page, and generate your internal linking with that. Stuff like "related searches", "related products", "Users also searched for", etc.

For example, with PHP, use a seed list of keywords and/or anchor variations combined with URLs. Then create a template for some set of pages, with a preg_replace against the list, and some parameters to randomize it. Like, if those phrases are found, generate a random number of links within a range.

I forget off-hand, but I remember doing this before based on a hash function or some damn thing upon the first generation. I think I was storing the hash in a table or something, maybe along with some other data. Then each additional page load would consistently generate those same exact links + anchors. Idea being, stability. One and done. :wink:

That was a few years ago, so I don't remember unfortunately.

Ryuzaki said:
Of course we intend to interlink more to the pages that earn the money. Are we harming the page rank flow by linking to every post once, or enhancing it? Assuming that once Google gets a post in it's index (we're pretending sitemaps don't exist here), it'll crawl those once in a blue moon. Interlinking should ensure each page is discovered and crawled again and again.

As far as page rank flow, I'm honestly not sure.

With Googlebot, as far as I can tell there are at least 2 forms of it. One that crawls based on priority, high frequency change pages. Another that crawls much less frequently, like monthly.

If you think about the logic behind the rank transition platform, they've made fairly clear that parts of their systems do monitor page history, changes, and prioritize various functions and responses from it. So it stands to reason that, if they've determined a page is low priority, just because they crawl another page that links to it, they probably still prioritize the links on that page against some index of priority.

On at least one of my sites, I have seem some crawl behavior that would seem to support this. Stuff like pages that are crawled frequently, that I know for a fact have followed links to other pages that crawled more on a monthly basis.... And regardless of the first page being crawled every day, or few days, the second one is still only getting crawled every few weeks or month.

Ryuzaki said:
The possibly benefit would be ensuring Google indexes the maximum number of pages possible, which will have sitewide benefits related to the size of the domain and the internal page rank generated. The downside is flowing page rank to less important pages a bit more.

I question whether that's a thing anymore, if it is how much longer will it be, and/or whether there's a threshold beyond which it's no longer a thing. Let me explain.

I've actually been seeing evidence with several sites, large and small, where consolidating pages, reducing duplicate/thin content, and overall reducing page volume has been a net benefit. In one case, reducing pages on one site by several million. In others, reducing sub 1K page sites by 20-30%.

I think multiple factors are involved with some of the results I've been seeing. In some cases, I think aggregating the traffic and engagement metrics to fewer pages may be part of it. In others, maybe more technical like improvements in TF*IDF based on the consolidation, and maybe that leads to some increased trust, value, priority, or whatever.

Ryuzaki said:
Also, what do you suppose Google's crawl depth really is? How many leaps will they take from the starting point?

In the right conditions, whatever the logical extreme is. In one case, with a major sitewide change, TLS migration, and a few other things, it clearly triggered a priority with Googlebot. This resulted in a ~1,600% increase in crawl rate within a short period of time, and sustained for a little while.

Well, that's all well and good.....but it meant they found more internal link deficiencies FASTER, and just drilled down into them like a GBU-57 Massive Ordnance Penetrator. You think duplicate content sucks? Just wait until you see your index grow by several million pages within a few weeks. :wink:

This coupled with page logic like keyword + city, or keyword + zipcode, is self-propagating and can quickly grow out of control if the page logic or DB table isn't rigidly controlled.

Under normal crawl behavior, however, I've seen enough to suggest that much beyond 3 clicks is probably not great. 5-7+ is probably a terrible long-term proposition, though I haven't been in every niche, so it might be totally different in others.

Ryuzaki said:
And finally, do you know of a spidering software that can crawl while ignoring specified sitewide links like navigation, sidebar, and footer, and category links after crawling and tell you which posts are essentially orphaned?

With extremely large sites, I haven't seen much that I've been truly impressed with. At least not to the degree that I felt fully satisfied by. I've used Deepcrawl, Ryte (formerly Onpage.org), Authoritas (formerly AnalyticsSEO), SEMRush, Screaming Frog, Xenu link Sleuth, and countless other services or apps.

I'd say, for 100K pages or less, many of those could serve most sites just fine. Stuff like 1M+ sites honestly demand a custom solution. No matter what I try, I always revert to building Scrapy bots. For the example you mentioned, you could combine Scrapy with the Beautiful Soup library and quickly do some cool stuff that gets the job done. In fact, I have a simple example here, in relation to crawling sites to get their content for purposes of determining TF-IDF weighting.

So in the case you mentioned, working with your own site, that could totally work. Use the logic in the example to drop all the structural elements you mentioned. Then BS4 has several ways to grab all the remaining hrefs in the body, or wherever you prefer.

From there, I can't think of the next step off-hand. You'd basically have those URLs thrown in a list, dict, tuple, or whatever, and then put them through other functions, middleware, or something. There might actually be an easier way purely with Scrapy, or a slightly different combination of that example.

Sorry about the lack of specifics. Normally, I tend to use source lists for scraping, to keep things controlled and consistent. Though I know there are some ways with the framework, to just set the entry point and let it go until some predefined "stop" point is reached (depth, path, volume, etc.).

Ryuzaki · Jan 13, 2018

turbin3 said:
I just wouldn't be able to sleep at night, knowing there is actually a single, non-canonical, crawlable page on one of my small sites.

This was your comment concerning noindex/follow sub-pages on categories, which I do, and I've spruced up Page 1 on them. Is this just because you don't like there being pages that aren't completely under your thumb and controlled, or because you'd rather index them? I can see having them indexed creating another point of entry for the crawlers that are referencing the index themselves. But I'm more worried about Panda eventually than I am with creating those entry points.

turbin3 said:
Though, I want to say page quality + lack of traffic and/or lack of engagement are probably significant factors.

This comment was about keeping sections crawled regularly. I'd guess that Google measures the things you've mentioned, but I'd guess that they also care about impressions in the SERPs. Because the size of the net is exploding way faster than the size of the population. Most pages won't ever see any reasonable amount of traffic, but Google wants them available just in case. It's how they eat. But for sure, I'm working on adding content to these pages I'm concerned with, which should boost traffic and engagement and impressions.

turbin3 said:
Most people that say "you shouldn't" have likely never dealt with a large site before. It's just a different ballgame.

I'd say it's important on every site unless it's a small 5-10 page business site. These naive webmasters using 500 different tags on 30 posts are creating a serious Panda problem that's really no different than auto-genenerated pages with boilerplate duplicate content.

turbin3 said:
On at least one site, I've seen behavior where duplicated words within the URL seemed to be detrimental.

Yeah, @ddasilva was talking about this recently. Another reason to not get a PMD or EMD. But you have to be careful with the names of your categories too. In the site I'm talking about, I have one instance like this that doesn't seem to be hurting the ranking abilities so far. That category has some #1 slots for some competitive info and buyer terms.

turbin3 · Jan 13, 2018

Ryuzaki said:
This was your comment concerning noindex/follow sub-pages on categories, which I do, and I've spruced up Page 1 on them. Is this just because you don't like there being pages that aren't completely under your thumb and controlled, or because you'd rather index them? I can see having them indexed creating another point of entry for the crawlers that are referencing the index themselves. But I'm more worried about Panda eventually than I am with creating those entry points.

To put some context behind it, I basically see category pages in a different manner from how they're often configured on many CMS' by default. Pagination, for example. Why have potentially dozens, hundreds, or more pages that, if they even get any traffic, it's just the odd click and 5 second visit before hitting another page?

There are often some creative ways with UI to display all on a single page. The real trick then becomes, organizing it, making it perform well (lazy load images for example), and providing useful nav, filter, facet functions for people to get to what they need. All those 5 second pageviews across all those other pages might then become minutes on one single page or a small set of similar pages.

I guess what I'm getting at is, I effectively want to turn my category pages into little "apps" unto themselves, instead of fragmenting that traffic and crawl budget across tons of low value pages. Those would then become pages more worth indexing in my opinion.

Samwise89 · Jan 17, 2018

I've built sites that maxed out around 100 pages and had this same concern, because I didn't have that many categories so posts were getting tucked back in the 4th and 5th level of pagination. I had done decent interlinking, but I still wanted to make sure the click depth wasn't too crazy.

So I used a HTML sitemap plugin and linked to it from the footer. This way every post was one click away from the homepage or any other page and the link power was spread around better.

I don't know if it helped or hurt, but that's how I solved it. That wouldn't work for a site with too many pages though. You'd have to create multiple HTML sitemaps and you'd end up creating a trap for spiders.

108mbps · Nov 9, 2022

hi
i want to know nor ask for suggestion about crawl budget practices, have you ever work on crawl budget? what is the best practice to work on it, and how to measure the progress or what should we look at before and after we work on it, and how to read and maintain on it.

any suggestion and advice is much appreciate.
thank you

dimasahmad111 · Nov 13, 2022

Hi, Buso.

I have a question on structure of the website, let's say i crawled the site, and it said a bunch of url page level, let say from 1 - 10 folder, and each level of page is important url.

so, when i talk about crawl budget nor the important of structure of website, they said (the article that i read), is keep under 4 (folder) deep level page.

and i read about click path and distance, after reading through it is about click link, not folder. (i am confuse actually)

What should i do, and if i doit, what todo after that? thank you. any answer might be help.

Ryuzaki · Nov 13, 2022

@108mbps & @dimasahmad111, neither of you told us how big your website is. Please do. You may not even need to worry about crawl budgets. 99% of sites, I'd estimate, don't need to worry about it at all. Are either of you experiencing things that make you think you need to deal with this, or is it just some info you came across and are now concerned because it's now in your awareness?

108mbps said:
what is the best practice to work on it, and how to measure the progress or what should we look at before and after we work on it, and how to read and maintain on it.

Search Console tells you how much you're being crawled daily. You can look at that to see if you get any improvements. What you want to do is improve the click depth from any one page to any other page, and you want more page rank (links), which is largely how Google allocates crawl budget. If they detect you're targeting queries that deserve freshness, you may get a higher crawl budget too.

But a high crawl budget does not equal more traffic or whatever. Google allocates it where they deem it necessary. You don't need a high crawl budget, you just need to not waste what you're already getting. You do this by reducing your click depth, and you do that by interlinking and structuring your site better.

dimasahmad111 said:
so, when i talk about crawl budget nor the important of structure of website, they said (the article that i read), is keep under 4 (folder) deep level page.

If I recall, Googlebot will crawl like 5 leaps (I think, I can't remember the exact number) before they bail and come back later. That's probably where they got the 4-leaps idea from.

dimasahmad111 said:
and i read about click path and distance, after reading through it is about click link, not folder. (i am confuse actually)

Forget about the folder thing. Whatever you were reading sounds archaic like very old HTML sites without a CMS involved and without sufficient interlinking. Click depth is a better way to think about it, which is simply "starting at a given page, how many clicks does it take to reach the destination page".

dimasahmad111 said:
What should i do, and if i doit, what todo after that?

This could be an entire novel. You need to ask more specific questions.

To give you a starting point, you should show more posts on your category pages so there's fewer paginated category pages. You should interlink in your site all around (relevantly). You should display categories and sub-categories if needed, in your menus (whether that be main navigation, sidebar, footer, etc.). You should upload a sitemap to Search Console.

Having things like "related posts" lists at the bottom of posts can help too, but I don't recommend having them randomized, personally. I like them to be static. Get rid of broken links and redirected internal links (meaning fix them so they point to the actual destination URL so there's no extra leaps).

Basically, you want to interlink enough that any post can be reached from any page through any links on those pages within X amount of clicks. A big enough site won't be able to do this in 4 clicks or whatever, and that's okay too.

Usually, if you need more crawl budget, you'll be assigned more crawl budget. And you're probably not doing anything so whacky that you're wasting it in such a way that Google is going to harm your rankings over it.

wikibum · Nov 28, 2022

I was going to open a new thread with this but I think this question fits within this one. Please correct me if I am wrong!

I am trying to solve the issue of "Orphan pages" - or pages that don't have any internal links pointing to them.

The Situation
Right now, one of my sites has a decent amount of posts ~250+. That's not a lot of content but its not small. Every time I run site audits, I see a bunch of pages showing up as orphan pages, even though I can clearly see them on the 1st page of the category page. I also have related posts (with around 4 posts) showing up for the same category.

The Problem
As the site grows, I see this issue getting bigger and bigger for a few reasons. Having related posts (within the same category) show up within the specific post article is very limited. You can only show a specific set of posts and they have to be sorted by some sort of parameter like: date published, etc.. So all the posts within a category are essentially showing the same posts within them.

Some solutions

1) I think @Ryuzaki mentioned this in a thread (not sure which one), were you would manually build the related posts area with other posts that you think are relevant to that post. Basically, creating different related posts every time making sure that they are highly relevant to the same post. This seems like the best solution so far, but it's very time consuming and I am not sure if its scalable with more and more content being published. I was thinking of adding it to the writing process but I am still not sure if this would solve the problem entirely.

2) This one is obvious. Find terms within the posts themselves and interlink them that way. This will help solve some of the issues, but I doubt it will solve all the orphan page errors. There will always be some posts that don't have terms on other pages to link from.

3) Create additional sub category for when the category grows past a certain # of posts. I am still not sure what that number is, but from what @turbin3 said above, it might be good to start when that number gets to ~50 or so.

4) For category pages. Right now, we are using a "load more" function were the user can keep scrolling and more posts will show up as they scroll. I am not sure if creating paginated pages is better, but I couldn't find a case study that talks about advantage/disadvantage about this sort of structure.

5) Linking the sitemap from the footer. I am not a fan of this but it seems to be a good idea to decrease the crawl depth and click depth, however I have not tested it and I am not sure what impact it would have. I was thinking just adding the sitemap to the homepage footer only and not have it sitewide, but I don't think that will solve the problem.

Any/all feedback is appreciated!

Ryuzaki · Nov 28, 2022

@wikibum, it sounds to me like whatever crawler is being used in your audits isn't doing a great job. If you can see a post is in the first, non-scroll required, page of a category and it's saying it's orphaned, then it's wrong.

Here's my comments on your numbered solutions:

1) I still do this and it's not too time consuming for the publishing, if the person doing it is familiar with what content already exists on the site. You can create rules too, like... if there's a total of 5 posts listed, 3 of them go to the 2nd-tier relevance (where the 1st-tier would be interlinked from the main content), and then the final 2 can go to newer posts that are semi-relevant at least. This makes sure you're spreading the love around and shooting Google's spiders into old content and new. I do this part during my interlinking tasks and it's definitely the fastest part, and easy enough that I'm gearing up to hire and train someone to do it. It's not fool proof though, some pages will be missed.

2) This should definitely be done. What I do is when I'm formatting content from a text file, I then skim through it and internal links pointing out from these new posts. When a post gets published, part of the workflow is to go into existing older posts and add links back to the new posts. So if that's done for every single post, there will never be an orphaned post (in terms of contextual interlinking). Every post will have at least one internal link. If there's no good opportunity in an older post to link to the new one, I go to the most relevant content I have and create that opportunity. No post gets left behind!

3) Creating sub-categories as well as increasing the number of posts shown per paginated page are huge ways of reducing click depth. At a certain point it will always become a problem again, though. You can't keep splintering categories forever, which is what the interlinking does. It really helps reduce click depth so you don't have to rely on categories.

4) I do imagine that the "click to load more" could pose a problem for Google eventually in terms of scrolling and rendering and analyzing the new content that's loaded. They're not going to do all of that forever. I prefer paginated pages, myself, as a user and for technical SEO considerations.

5) I don't think you need to do this. I wouldn't link to an XML sitemap anyways. Maybe an HTML sitemap but isn't that what categories essentially are? And eventually you'll need to paginate the HTML sitemaps. But if you have your sitemap listed in your robots.txt and you've submitted them to Search Console, I wouldn't bother. There's not going to be any way around needing to interlink, which is going to be the best way to solve the issue.

wikibum · Nov 28, 2022

Ryuzaki said:
it sounds to me like whatever crawler is being used in your audits isn't doing a great job. If you can see a post is in the first, non-scroll required, page of a category and it's saying it's orphaned, then it's wrong.

I figured. It's the Ahrefs crawler but I will verify with another tool.

Thanks for all the replies @Ryuzaki. I think I will start doing option #1 because dynamically changing the posts on each post is not stable and I would rather Google just see the same things on the post if they recrawl it. Might decrease some of volatility and easier for us to optimize in the future.

All that being said - do you add a "Recent posts" lists to the post as well or do you just skip that and focus on adding only relevant category posts like you mentioned above. I am debating whether to keep the "recent posts" (which is essentially the newest posts) to give them a little boost at first and start manually linking to relevant posts in another area within the same post.

Darth · Nov 28, 2022

On #4, I recently moved away from "LOAD MORE" for this reason. I use paginated archive pages and index them all.

Ryuzaki · Nov 28, 2022

wikibum said:
All that being said - do you add a "Recent posts" lists to the post as well or do you just skip that and focus on adding only relevant category posts like you mentioned above.

No, I don't like anything dynamic on the site, honestly, beyond the homepage and category feeds that change as I publish. Even on those I add a little bit of static content that never changes. No particular reason other than to push juice where I want from the homepage and just to add something of extra value on the category pages for Google, which I only show on the first page of the category.

Crawl Depth, Crawl Budget, Sitemaps, & Interlinking

Ryuzaki

お前はもう死んでいる

turbin3

turbin3

Ryuzaki

お前はもう死んでいる

turbin3

Samwise89

108mbps

dimasahmad111

Ryuzaki

お前はもう死んでいる

wikibum

Ryuzaki

お前はもう死んでいる

wikibum

Darth

Ryuzaki

お前はもう死んでいる