Site Structure For Massive Sites

pluto · Mar 14, 2019

Taking on a client which has a lot of user generated content (like 500k-1M posts) which include text and images. They also have category pages. Almost all of their traffic currently come from social media sites and referrals.

There are some easy wins for me to do here (increasing site speed, restructuring the content formatting so that it's got proper html tags, etc. etc.)

They have some links already, so I think with some good on-site improvements they should see quick gains as they have such little SEO traffic given their complete trash structure and on-site.

What I need advice on:

I don't have experience with internal linking and structure for sites with this many pages and would love to get your thoughts.

Here's what I've gathered from my research so far:

Their site is basically broken down into these pieces:

content (images + text) -- good long-tail ranking targets (~500k of these)
categories (big lists of content, like "baking recipes") -- (~150k of these)
profiles (creators) -- (~20k of these)

I've been trying to study Pinterest, which seems to have a similar issue/ratio.

My Thought Process:

Home page links to top content + stuff as many categories as I can (like 50 or 100 in the header or footer) + link to a page with a ton more categories (like 500 of them)

Category pages link to as much content as I can and also to other related categories (like 10-20 of them) as well as profiles

Profiles link to that profile's content

Content links to profile and more similar Category pages.

I see sites like Pinterest indexing tens of millions of pages. I assume it's by just jamming as many links as they can into pages right? Anything else I should keep an eye out for? Anybody work on sites like this before?

Otherwise, it'll be a fun game of stuffing as many links and interlinking as much as I can haha

Ryuzaki · Mar 14, 2019

Categories
You could make Meta Category pages that group those categories into sub-groups. So like, a giant category that contains no posts but only sub-cats (not even really a category, but could just be a page you create to act as if it is one). The sole purpose of these would be to link out to all of the other categories and you could stuff them into the header / footer instead of 100 other categories. Crawling and page rank flow would only be one hop away, but keep your navigation cleaner.

Alternatively you could use some kind of mega menu in combination with this tactic so you can still show the most important "sub-categories" to the user on hover or whatever.

Profiles
I'd really take a solid look at the user profiles and see what's on them and if they provide any unique content of value (doubtful). If they're nothing substantial, I'd noindex them. Should be easy to write an if statement to slap <meta name="robots" content="noindex,follow" /> in the <head>. Sounds like a Panda trap if you let them continue to be indexed. Panda isn't really "on or off" like Penguin. It can work in fractions and you'd never know you were being held back by a lower quality score.

If you do choose to noindex these profiles, I'd go through your templates and add nofollow to all of the links leading to the profiles too. Yeah it sucks that you'll "waste juice" but if you aren't indexing the pages anyways then who cares. What you will do, and it's important for a huge site like this, is preserve your crawl budget and not keep sending Google through our noindex section. Alternatively, you could not link to these pages at all and preserve the juice.

Indexing
I wouldn't want to index as much as possible. Interlinking helps but the most important part of managing indexation in my opinion is submitting sitemaps. There's a link limit within each one but you can generate as many as you need. If you can find an automated way to do this, that'd be perfection, especially if you can make a sitemap index that automatically receives new sub-sitemaps. Then you just submit the index and Google will auto-discover the new sub-sitemaps as they're added.

pluto · Mar 14, 2019

Ryuzaki said:
Categories
You could make Meta Category pages that group those categories into sub-groups. So like, a giant category that contains no posts but only sub-cats (not even really a category, but could just be a page you create to act as if it is one). The sole purpose of these would be to link out to all of the other categories and you could stuff them into the header / footer instead of 100 other categories. Crawling and page rank flow would only be one hop away, but keep your navigation cleaner.

Alternatively you could use some kind of mega menu in combination with this tactic so you can still show the most important "sub-categories" to the user on hover or whatever.

Profiles
I'd really take a solid look at the user profiles and see what's on them and if they provide any unique content of value (doubtful). If they're nothing substantial, I'd noindex them. Should be easy to write an if statement to slap <meta name="robots" content="noindex,follow" /> in the <head>. Sounds like a Panda trap if you let them continue to be indexed. Panda isn't really "on or off" like Penguin. It can work in fractions and you'd never know you were being held back by a lower quality score.

If you do choose to noindex these profiles, I'd go through your templates and add nofollow to all of the links leading to the profiles too. Yeah it sucks that you'll "waste juice" but if you aren't indexing the pages anyways then who cares. What you will do, and it's important for a huge site like this, is preserve your crawl budget and not keep sending Google through our noindex section.

Indexing
I wouldn't want to index as much as possible. Interlinking helps but the most important part of managing indexation in my opinion is submitting sitemaps. There's a link limit within each one but you can generate as many as you need. If you can find an automated way to do this, that'd be perfection, especially if you can make a sitemap index that automatically receives new sub-sitemaps. Then you just submit the index and Google will auto-discover the new sub-sitemaps as they're added.

thanks so much for this! actually didn't think about sub-categories. that would certainly make it a lot easier. reminds me of how Amazon does it. will look at them too

on the profile thing. a lot of people seem to link to their own profiles (and there seem to be some famous people who have profiles, though I haven't identified all of them). but otherwise, yes the content on profiles is not all that interesting and certainly not substantial. Would no-indexing them waste the link juice passed to them from their links at all? Might be harder to technically, but perhaps I could no-index based on certain criteria

will ask if they have a site map already. thanks!!

Ryuzaki · Mar 14, 2019

What CMS is this site built on? (Wordpress, etc.)

I'd think about setting up an if statement that has exceptions like...

Code:

<?php
    function ryu_noindex_profiles() {
        if( !is_author( array( 92,173,8274 ) ) ) {
            echo '<meta name="robots" content="noindex,follow" />' . "\n";
        }
    }
    add_action('wp_head', 'ryu_noindex_profiles');
?>

That's a Wordpress example that would go into functions.php. The array would be profile ID's that you do NOT want to be noindex. So yeah, you could find the ones with "greater than X links" and keep them indexed if you wanted.

But notice the "noindex,follow" which keeps the links on the page dofollow. Even though I recommended putting nofollow on links leading to these profiles, they would still be crawled from external links and the page rank juice would still flow to that profile author's posts. This would be the case for ones you keep in the index and ones you noindex.

turbin3 · Mar 20, 2019

I have deep thoughts and considerable experience in the realm of large sites like this, but I'll start with a few highlights.

The core areas of technical concern for most large sites:

Crawl Rate Optimization
- Site speed
- Resource usage
Internal Link Structure
- Click depth
- Orphaned pages
Information Architecture
- Content coverage & depth
- Taxonomies & hierarchies

There are others, of course, but I'm just listing some of the more important ones for now.

First off, where I would probably start is an audit of existing traffic and analytics data. Hopefully you have access to the client's traffic logs or Google Analytics data. If so, I'd dump logs over a period of time. Think multiple fiscal quarters. I might even do 12 months, depending on traffic volume.

Also, if they have Google Search Console, look at the old version (Google Webmaster Tools), since the crawl rate report hasn't yet been migrated to GSC. This report is rudimentary, and only shows overall crawl volume over time. Still, it can help being able to see major spikes or declines.

The idea is, you want to see what pages are getting traffic and crawls regularly. You'll want to focus on aligning traffic and crawls on the most valuable pages of the site. That sounds obvious, but it's easier said than done on large sites. Google and Bing bots, both, will take large site crawl paths to their natural extreme. Trust me on that. I have horror stories.

Having access to raw traffic logs can be critical for a large site like this. That way you can identify Google or Bing's crawl behavior and see what page types they're spending their time crawling. In some cases, they might be stuck on low value parts of the site or running into other technical issues (e.g. redirect chains, 404's, etc.).

If you're able to get access to raw traffic logs, let me know if you need help analyzing them, and we can break down some tactics here in the thread. Sometimes it takes a bit of work, but it's usually not too big of a deal to break down the logs into columns and start sorting through things. Ignoring the programming-level solutions for a minute, this is where having a powerful text editor plus some regex skills can come in quite handy!

It's getting late, so I'll reply more tomorrow (.....or later today LOL). Here's something I'll mention, as it creates a lightbulb moment for many people when they realize this. There's nothing magical about Googlebot. It's actually just a bot using headless Chrome browser to load and render pages.

Now stop and really think about the implications of that last statement. Googlebot is effectively just a web browser loading up tons of pages of your site. Now think about everything that goes into a page load. The browser has to download, parse, and execute all of a page's resources. Now multiply that by how many pages per day, per week, per month.

See what I'm getting at? Most people, with sites of a few hundred or very few thousand pages, don't even have to be concerned much about these things. However, when you start to get to sites of tens or hundreds of thousands of pages or more, that technical debt and resource usage adds up.

Couple this with the fact that Google has crawling bots and indexing bots, and the two aren't the same. There's typically a lag between them. Again, not as big of a deal on a small site. On a large site? It can lead to the death knell of your site's ranking potential.

For example, imagine there's a 1 week lag between a page crawl versus when the index for that page updates. Now imagine that page only gets crawled once per week... That means if you make an update or enhancement, it might take 1 week before it's even recrawled. Then add an additional week for the indexers to update. I'm making those numbers up, but I'm sure you get the idea.

The point is, if there's lags of days or weeks from when certain types of pages are crawled, it just amplifies the problems. Updates are seen less frequently and ranks take longer.

Identifying issues like that is absolutely critical on sites of this size. It can be the determining factor on whether you make certain drastic changes or not, such as:

Eliminating entire page types
Cutting page volume considerably
Removing entire link types (e.g. profile, date, comment, or other links)

Some of the most extreme examples of the worst types of site structure can be found in FORUM based sites. This is actually a significant part of the reason so many well-aged forums have been dying out in terms of organic.

Just look at any one forum page on a stone-age vbulletin site, and you'll see crawlable, indexable links on practically every damn page component, multiplied by millions of pages. In cases like that, you basically give Googlebot PTSD to the point where it can't handle the abusive relationship with such a site. LOL

pluto · Mar 20, 2019

turbin3 said:
I have deep thoughts and considerable experience in the realm of large sites like this, but I'll start with a few highlights.

wow this is fantastic. it looks like I do have access to the crawl logs. I think I can figure out how to parse this stuff and identify the crawl patterns, but not positive how to use that information to my advantage.

I suspect the actively re-crawled pages are the pages where I'd want to put more links that I want re-crawled so that Google prioritizes those? and that'd probably be where I'd also want to de-index some profile pages or pages with thin content to help the crawl budget as Ryuzaki suggested above?

very much looking forward to your follow up reply. this is gold! thank you so much

turbin3 · Mar 20, 2019

Right. Ryuzaki's recommendations are well worth considering. Stuff like that can often make a nice improvement in crawling and indexing with a large site.

As far as actively re-crawled pages, yes and no. Depends on whether they're the right types of pages. If it turns out the bots are getting stuck on things like profile pages, you'll probably want to start increasing the things you're blocking with nofollow/noindex, potentially robots.txt, etc.

Another thing that can happen is, a large site could end up lacking in what I call Intermediate Category Pages. Around this time, last year, we had a decent thread on here about site structure with large websites. I'd recommend checking it out for more ideas. These types of pages are great for doing exactly what you mentioned; increasing internal linking and reducing click depth. When large sites lack these types of pages, you'll often find a much greater number of pages with deeper click depth, which almost always performs worse.

Another good thread, several months ago, we covered a number of strategies for organizing website taxonomies. Much of it was geared towards Wordpress, though the tactics are applicable regardless of the tech a site might use.

As far as traffic logs, if you're comfortable posting this, what kind of volume in "accessed" pages can you see over say a 7 day period? With large sites, one of the annoying things is dealing with the limits of MS Excel. It's limited to 1,048,576 rows and 16,384 columns. Columns usually aren't the issue, especially for something like traffic logs.

Thankfully, there are some options for GUI-based editors that can handle large text files. For CSVs, for example, I've had great success with Delimit Pro. It's well worth the price. At one point, I think I had something like 13GB of raw traffic log dumps, or backlink profile dumps (I forget) loaded up in the program.... and it worked just fine. LOL

For tons of data, you can quickly reach the level of probably needing a programming-level, or terminal/command line level solution to handle parsing efficiently.

As for me, personally I tend to reach for Sublime Text most frequently. There are plenty of other great text editors, including some that are far more powerful. What helps, though, is being able to use things like simple CLI flags (say if you're on Linux or OS X, or a custom Windows terminal setup). Or having a text editor, programming language, or other application that can utilize a bit of regex code.

For example, with Sublime, I haven't had much of an issue doing things like loading up a CSV or other text file, and then running regex searches on it. Depends a bit on how complex the regex is.

In other cases, say on a machine with Unix, Linux, OS X, or a custom Windows terminal that has Unix-oriented commands (Cygwin, for example), sometimes you can get by running a grep search from the terminal.

Just curious, @pluto, but do you have any experience with a programming language or CLI tools? Let us know what you're working with and maybe we can figure out an easy solution.

pluto · Mar 21, 2019

turbin3 said:
Right. Ryuzaki's recommendations are well worth considering. Stuff like that can often make a nice improvement in crawling and indexing with a large site.

As far as actively re-crawled pages, yes and no. Depends on whether they're the right types of pages. If it turns out the bots are getting stuck on things like profile pages, you'll probably want to start increasing the things you're blocking with nofollow/noindex, potentially robots.txt, etc.

Another thing that can happen is, a large site could end up lacking in what I call Intermediate Category Pages. Around this time, last year, we had a decent thread on here about site structure with large websites. I'd recommend checking it out for more ideas. These types of pages are great for doing exactly what you mentioned; increasing internal linking and reducing click depth. When large sites lack these types of pages, you'll often find a much greater number of pages with deeper click depth, which almost always performs worse.

Another good thread, several months ago, we covered a number of strategies for organizing website taxonomies. Much of it was geared towards Wordpress, though the tactics are applicable regardless of the tech a site might use.

As far as traffic logs, if you're comfortable posting this, what kind of volume in "accessed" pages can you see over say a 7 day period? With large sites, one of the annoying things is dealing with the limits of MS Excel. It's limited to 1,048,576 rows and 16,384 columns. Columns usually aren't the issue, especially for something like traffic logs.

Thankfully, there are some options for GUI-based editors that can handle large text files. For CSVs, for example, I've had great success with Delimit Pro. It's well worth the price. At one point, I think I had something like 13GB of raw traffic log dumps, or backlink profile dumps (I forget) loaded up in the program.... and it worked just fine. LOL

For tons of data, you can quickly reach the level of probably needing a programming-level, or terminal/command line level solution to handle parsing efficiently.

As for me, personally I tend to reach for Sublime Text most frequently. There are plenty of other great text editors, including some that are far more powerful. What helps, though, is being able to use things like simple CLI flags (say if you're on Linux or OS X, or a custom Windows terminal setup). Or having a text editor, programming language, or other application that can utilize a bit of regex code.

For example, with Sublime, I haven't had much of an issue doing things like loading up a CSV or other text file, and then running regex searches on it. Depends a bit on how complex the regex is.

In other cases, say on a machine with Unix, Linux, OS X, or a custom Windows terminal that has Unix-oriented commands (Cygwin, for example), sometimes you can get by running a grep search from the terminal.

Just curious, @pluto, but do you have any experience with a programming language or CLI tools? Let us know what you're working with and maybe we can figure out an easy solution.

This is really great thank you!!! At a skim, it looks like the accessed pages is in the high hundreds of thousands, not quite at a million. I know some basic regex (and know enough to use StackOverflow to figure out the rest) and usually use Sublime Text as my default for parsing and manipulating large amounts of data as well. I wouldn't say I'm a great programmer, but I know some basic Python and Javascript. Enough to script things.

Those links were great. I think my plan of action for the site structure after reading all of the links and responses above is:

1. De-index thin content that might be wasting crawl budget: As of now profiles seem to be the biggest culprit. Some of the user posted stuff might be thin as well. Will look into some filters to make it so that pages that are junk quality can be de-indexed.

2. Figure out a way to do categories + subcategories: This is a bit harder because all of their content is created by users. They do have some tags and categories and stuff, but outside of the greater level of categories, it's harder to build the tree with so many types of tagging mechanisms. Probably would need to make some sort of massive hierarchy tree of all the existing categories/tags on the site. I suppose I could also just implement a simpler tree for them and have them push those categories on their users. Or some of the existing tags into categories I set.

3. Add interlinking between the pages: Put as many categories and sub-categories on the home page. Have sub-categories link to parent categories, children categories, and other related categories. Stuff sub-category pages with content and links to the user generated pages

4. Make a sitemap for them: It looks like the one they have on file isn't programmatically updated and out of date

pluto · Apr 1, 2019

@turbin3 @Ryuzaki quick follow up question that is pretty basic.

studying their in-bound links quite a bit more, i'm noticing that almost half of their links are to profile pages, which makes sense because a lot of people reference their own profiles on their own profiles. Ryuzaki's suggestion of hard-coding certain profiles as indexed while no-indexing everything else would be a bit hard since there are 10k+ profiles it looks like that are linked here and I'm sure plenty more every day.

While the profiles themselves are still pretty thin content and won't lead to too much relevant traffic, I do want to take advantage of the link value.

Is there anything I can do to maximize all of these links?

will de-indexing the pages remove the inbounds links value?

Ryuzaki · Apr 1, 2019

@pluto, I don't think we have a definitive answer to that question. You can have pages that are set-up as "noindex, nofollow" or you can have "noindex" which essentially means "noindex,follow" but there's no follow/dofollow directive.

I'd imagine that Google has an index and link graph that include everything, even pages that aren't allowed to be shown in the index. Otherwise they wouldn't have an incredibly accurate link graph.

I watched a talk by a lady that became Moz's head of SEO or some position like that. Her first move as the new boss was to noindex like 30% of their indexation (may have been way more, I can't remember exactly). Nearly all of that was user profiles.

The thing is, they have access to their own backlink database, so I'm sure they were able to tie in and do a check to automate adding a noindex directive if... # of links > X. Otherwise you're right, it's going to be a manual process of checking which should remain indexed.

What I can tell you is that Moz had a huge success with it. The reality is that almost all of those profiles offered no value, and that's probably the case for you too.

The real question is "does the benefit of having links from X number of extra domains outweigh the risk of having some form of algorithmic penalty applied by Panda?" And I'd say no. Especially since I'm pretty sure Panda is being applied like a fractional Quality Score multiplier and not an on/off penalty. Which would explain why when Moz dropped 1000's of pages out of the index (and lost the links) they still saw a boost in organic traffic.

turbin3 · Apr 3, 2019

The question I'd have, if it was my site, is how valuable are those external links to some profile pages really? I mean, I'm sure we're not exactly talking about followed homepage links they've linked to their profile pages. I'd guess they're largely low value, but that's just a guess.

Judging by everything I've heard so far, the area I might be most concerned about is crawl volume and frequency of your more important pages. With say a 1M post/page site, if it was me, I'd want to see no less than 100K pages crawled per day. Ideally, I'd want at least several hundred thousand so updated content should get found at least every few days or so.

Even in a perfect world, if there was no overlap in crawled pages, 100K/day would mean it would take 10 days for them to crawl the whole site. And there's always overlap, so the time-frame will be longer.

In extreme cases where I've seriously been trying to remove all barriers to crawl rate, I'll even take it to this extreme with link removal throughout a site. Think of blog post lists. You usually have:

Image
Title
Description
Author
Date
Category / Tags

Not all post lists will have all of those things. A lot of blogs and website templates, by default, also have links on the Image and Title, both linking to the same page. They'll also probably have Author profile links, maybe even date links if you happen to have date-based archives (probably get rid of those!).

So just think about that. On a category page or general post list page, with a single post, you might have 3-5+ links. Some duplicate, some to different page types. Like I mentioned above, typical forum software is literally the worst in the world as far as internal link proliferation goes.

So what would I do in a case like that example? What I often do on category and list pages is remove all links in a post "block" or card except for 1 actual link to the post. For some sites, if it's a regular list/blog feed page, I might leave the category links if I feel it's helpful. Here's an example of what I'm talking about:

The Before Version:

HTML:

<article>
  <a href="/post-title/">
    <img src="image.png" alt="image description" />
  </a>
  <a href="/post-title/">
    <h2>Post Title</h2>
  </a>
  <p>Post excerpt.</p>
  <span>
    by <a href="/author/billy-mays/">Billy Mays</a>
  </span>
  <time>
    <a href="/1999/01/">Jan 1, 1999</a>
  </time>
  <span>
    <a href="/category-1/">Category 1</a>
  </span>
</article>

The After Version:

HTML:

<a href="/post-title/">
  <article>
    <img src="image.png" alt="image description" />
    <h2>Post Title</h2>
    <p>Post excerpt.</p>
    <span>by Billy Mays</span>
    <time>Jan 1, 1999</time>
    <span>Category 1</span>
  </article>
</a>

With HTML5, it's totally within spec to have all those HTML tag types within an <a> tag. Maybe that won't work for some blog designs, but for stuff like a card grid, now you have a single CTA and single link on a post. They just click anywhere on the card, so probably even better UX too.

Now I'm not saying you should absolutely do this. It's easy to go too extreme in that direction as well, where you end up reducing internal linking a bit too much and end up with orphaned pages and actually increasing click depth for many pages. The point I'm getting at is, sometimes removing just a few link types within a page, at scale, can help a lot.

Site Structure For Massive Sites

pluto

Ryuzaki

お前はもう死んでいる

pluto

Ryuzaki

お前はもう死んでいる

turbin3

pluto

turbin3

pluto

pluto

Ryuzaki

お前はもう死んでいる

turbin3