Site Structure For Massive Sites

Joined
Feb 16, 2019
Messages
18
Likes
6
Degree
0
Taking on a client which has a lot of user generated content (like 500k-1M posts) which include text and images. They also have category pages. Almost all of their traffic currently come from social media sites and referrals.

There are some easy wins for me to do here (increasing site speed, restructuring the content formatting so that it's got proper html tags, etc. etc.)

They have some links already, so I think with some good on-site improvements they should see quick gains as they have such little SEO traffic given their complete trash structure and on-site.

What I need advice on:

I don't have experience with internal linking and structure for sites with this many pages and would love to get your thoughts.

Here's what I've gathered from my research so far:

Their site is basically broken down into these pieces:
  • content (images + text) -- good long-tail ranking targets (~500k of these)
  • categories (big lists of content, like "baking recipes") -- (~150k of these)
  • profiles (creators) -- (~20k of these)
I've been trying to study Pinterest, which seems to have a similar issue/ratio.

My Thought Process:

Home page links to top content + stuff as many categories as I can (like 50 or 100 in the header or footer) + link to a page with a ton more categories (like 500 of them)

Category pages link to as much content as I can and also to other related categories (like 10-20 of them) as well as profiles

Profiles link to that profile's content

Content links to profile and more similar Category pages.

I see sites like Pinterest indexing tens of millions of pages. I assume it's by just jamming as many links as they can into pages right? Anything else I should keep an eye out for? Anybody work on sites like this before?

Otherwise, it'll be a fun game of stuffing as many links and interlinking as much as I can haha
 

Ryuzaki

女性以上のお金
Staff member
BuSo Pro
Digital Strategist
Joined
Sep 3, 2014
Messages
3,287
Likes
6,004
Degree
7
Categories
You could make Meta Category pages that group those categories into sub-groups. So like, a giant category that contains no posts but only sub-cats (not even really a category, but could just be a page you create to act as if it is one). The sole purpose of these would be to link out to all of the other categories and you could stuff them into the header / footer instead of 100 other categories. Crawling and page rank flow would only be one hop away, but keep your navigation cleaner.

Alternatively you could use some kind of mega menu in combination with this tactic so you can still show the most important "sub-categories" to the user on hover or whatever.

Profiles
I'd really take a solid look at the user profiles and see what's on them and if they provide any unique content of value (doubtful). If they're nothing substantial, I'd noindex them. Should be easy to write an if statement to slap <meta name="robots" content="noindex,follow" /> in the <head>. Sounds like a Panda trap if you let them continue to be indexed. Panda isn't really "on or off" like Penguin. It can work in fractions and you'd never know you were being held back by a lower quality score.

If you do choose to noindex these profiles, I'd go through your templates and add nofollow to all of the links leading to the profiles too. Yeah it sucks that you'll "waste juice" but if you aren't indexing the pages anyways then who cares. What you will do, and it's important for a huge site like this, is preserve your crawl budget and not keep sending Google through our noindex section. Alternatively, you could not link to these pages at all and preserve the juice.

Indexing
I wouldn't want to index as much as possible. Interlinking helps but the most important part of managing indexation in my opinion is submitting sitemaps. There's a link limit within each one but you can generate as many as you need. If you can find an automated way to do this, that'd be perfection, especially if you can make a sitemap index that automatically receives new sub-sitemaps. Then you just submit the index and Google will auto-discover the new sub-sitemaps as they're added.
 
Joined
Feb 16, 2019
Messages
18
Likes
6
Degree
0
Categories
You could make Meta Category pages that group those categories into sub-groups. So like, a giant category that contains no posts but only sub-cats (not even really a category, but could just be a page you create to act as if it is one). The sole purpose of these would be to link out to all of the other categories and you could stuff them into the header / footer instead of 100 other categories. Crawling and page rank flow would only be one hop away, but keep your navigation cleaner.

Alternatively you could use some kind of mega menu in combination with this tactic so you can still show the most important "sub-categories" to the user on hover or whatever.

Profiles
I'd really take a solid look at the user profiles and see what's on them and if they provide any unique content of value (doubtful). If they're nothing substantial, I'd noindex them. Should be easy to write an if statement to slap <meta name="robots" content="noindex,follow" /> in the <head>. Sounds like a Panda trap if you let them continue to be indexed. Panda isn't really "on or off" like Penguin. It can work in fractions and you'd never know you were being held back by a lower quality score.

If you do choose to noindex these profiles, I'd go through your templates and add nofollow to all of the links leading to the profiles too. Yeah it sucks that you'll "waste juice" but if you aren't indexing the pages anyways then who cares. What you will do, and it's important for a huge site like this, is preserve your crawl budget and not keep sending Google through our noindex section.

Indexing
I wouldn't want to index as much as possible. Interlinking helps but the most important part of managing indexation in my opinion is submitting sitemaps. There's a link limit within each one but you can generate as many as you need. If you can find an automated way to do this, that'd be perfection, especially if you can make a sitemap index that automatically receives new sub-sitemaps. Then you just submit the index and Google will auto-discover the new sub-sitemaps as they're added.
thanks so much for this! actually didn't think about sub-categories. that would certainly make it a lot easier. reminds me of how Amazon does it. will look at them too

on the profile thing. a lot of people seem to link to their own profiles (and there seem to be some famous people who have profiles, though I haven't identified all of them). but otherwise, yes the content on profiles is not all that interesting and certainly not substantial. Would no-indexing them waste the link juice passed to them from their links at all? Might be harder to technically, but perhaps I could no-index based on certain criteria

will ask if they have a site map already. thanks!!
 

Ryuzaki

女性以上のお金
Staff member
BuSo Pro
Digital Strategist
Joined
Sep 3, 2014
Messages
3,287
Likes
6,004
Degree
7
What CMS is this site built on? (Wordpress, etc.)

I'd think about setting up an if statement that has exceptions like...
Code:
<?php
    function ryu_noindex_profiles() {
        if( !is_author( array( 92,173,8274 ) ) ) {
            echo '<meta name="robots" content="noindex,follow" />' . "\n";
        }
    }
    add_action('wp_head', 'ryu_noindex_profiles');
?>
That's a Wordpress example that would go into functions.php. The array would be profile ID's that you do NOT want to be noindex. So yeah, you could find the ones with "greater than X links" and keep them indexed if you wanted.

But notice the "noindex,follow" which keeps the links on the page dofollow. Even though I recommended putting nofollow on links leading to these profiles, they would still be crawled from external links and the page rank juice would still flow to that profile author's posts. This would be the case for ones you keep in the index and ones you noindex.
 

turbin3

BuSo Pro
Joined
Oct 9, 2014
Messages
605
Likes
1,223
Degree
3
I have deep thoughts and considerable experience in the realm of large sites like this, but I'll start with a few highlights.

The core areas of technical concern for most large sites:
  • Crawl Rate Optimization
    • Site speed
    • Resource usage
  • Internal Link Structure
    • Click depth
    • Orphaned pages
  • Information Architecture
    • Content coverage & depth
    • Taxonomies & hierarchies
There are others, of course, but I'm just listing some of the more important ones for now.

First off, where I would probably start is an audit of existing traffic and analytics data. Hopefully you have access to the client's traffic logs or Google Analytics data. If so, I'd dump logs over a period of time. Think multiple fiscal quarters. I might even do 12 months, depending on traffic volume.

Also, if they have Google Search Console, look at the old version (Google Webmaster Tools), since the crawl rate report hasn't yet been migrated to GSC. This report is rudimentary, and only shows overall crawl volume over time. Still, it can help being able to see major spikes or declines.



The idea is, you want to see what pages are getting traffic and crawls regularly. You'll want to focus on aligning traffic and crawls on the most valuable pages of the site. That sounds obvious, but it's easier said than done on large sites. Google and Bing bots, both, will take large site crawl paths to their natural extreme. Trust me on that. I have horror stories.

Having access to raw traffic logs can be critical for a large site like this. That way you can identify Google or Bing's crawl behavior and see what page types they're spending their time crawling. In some cases, they might be stuck on low value parts of the site or running into other technical issues (e.g. redirect chains, 404's, etc.).



If you're able to get access to raw traffic logs, let me know if you need help analyzing them, and we can break down some tactics here in the thread. Sometimes it takes a bit of work, but it's usually not too big of a deal to break down the logs into columns and start sorting through things. Ignoring the programming-level solutions for a minute, this is where having a powerful text editor plus some regex skills can come in quite handy!



It's getting late, so I'll reply more tomorrow (.....or later today LOL). Here's something I'll mention, as it creates a lightbulb moment for many people when they realize this. There's nothing magical about Googlebot. It's actually just a bot using headless Chrome browser to load and render pages.

Now stop and really think about the implications of that last statement. Googlebot is effectively just a web browser loading up tons of pages of your site. Now think about everything that goes into a page load. The browser has to download, parse, and execute all of a page's resources. Now multiply that by how many pages per day, per week, per month.

See what I'm getting at? Most people, with sites of a few hundred or very few thousand pages, don't even have to be concerned much about these things. However, when you start to get to sites of tens or hundreds of thousands of pages or more, that technical debt and resource usage adds up.

Couple this with the fact that Google has crawling bots and indexing bots, and the two aren't the same. There's typically a lag between them. Again, not as big of a deal on a small site. On a large site? It can lead to the death knell of your site's ranking potential.

For example, imagine there's a 1 week lag between a page crawl versus when the index for that page updates. Now imagine that page only gets crawled once per week... That means if you make an update or enhancement, it might take 1 week before it's even recrawled. Then add an additional week for the indexers to update. I'm making those numbers up, but I'm sure you get the idea.

The point is, if there's lags of days or weeks from when certain types of pages are crawled, it just amplifies the problems. Updates are seen less frequently and ranks take longer.

Identifying issues like that is absolutely critical on sites of this size. It can be the determining factor on whether you make certain drastic changes or not, such as:
  • Eliminating entire page types
  • Cutting page volume considerably
  • Removing entire link types (e.g. profile, date, comment, or other links)


Some of the most extreme examples of the worst types of site structure can be found in FORUM based sites. This is actually a significant part of the reason so many well-aged forums have been dying out in terms of organic.

Just look at any one forum page on a stone-age vbulletin site, and you'll see crawlable, indexable links on practically every damn page component, multiplied by millions of pages. In cases like that, you basically give Googlebot PTSD to the point where it can't handle the abusive relationship with such a site. LOL
 
Joined
Feb 16, 2019
Messages
18
Likes
6
Degree
0
I have deep thoughts and considerable experience in the realm of large sites like this, but I'll start with a few highlights.
wow this is fantastic. it looks like I do have access to the crawl logs. I think I can figure out how to parse this stuff and identify the crawl patterns, but not positive how to use that information to my advantage.

I suspect the actively re-crawled pages are the pages where I'd want to put more links that I want re-crawled so that Google prioritizes those? and that'd probably be where I'd also want to de-index some profile pages or pages with thin content to help the crawl budget as Ryuzaki suggested above?

very much looking forward to your follow up reply. this is gold! thank you so much
 

turbin3

BuSo Pro
Joined
Oct 9, 2014
Messages
605
Likes
1,223
Degree
3
Right. Ryuzaki's recommendations are well worth considering. Stuff like that can often make a nice improvement in crawling and indexing with a large site.

As far as actively re-crawled pages, yes and no. Depends on whether they're the right types of pages. If it turns out the bots are getting stuck on things like profile pages, you'll probably want to start increasing the things you're blocking with nofollow/noindex, potentially robots.txt, etc.

Another thing that can happen is, a large site could end up lacking in what I call Intermediate Category Pages. Around this time, last year, we had a decent thread on here about site structure with large websites. I'd recommend checking it out for more ideas. These types of pages are great for doing exactly what you mentioned; increasing internal linking and reducing click depth. When large sites lack these types of pages, you'll often find a much greater number of pages with deeper click depth, which almost always performs worse.

Another good thread, several months ago, we covered a number of strategies for organizing website taxonomies. Much of it was geared towards Wordpress, though the tactics are applicable regardless of the tech a site might use.

As far as traffic logs, if you're comfortable posting this, what kind of volume in "accessed" pages can you see over say a 7 day period? With large sites, one of the annoying things is dealing with the limits of MS Excel. It's limited to 1,048,576 rows and 16,384 columns. Columns usually aren't the issue, especially for something like traffic logs.

Thankfully, there are some options for GUI-based editors that can handle large text files. For CSVs, for example, I've had great success with Delimit Pro. It's well worth the price. At one point, I think I had something like 13GB of raw traffic log dumps, or backlink profile dumps (I forget) loaded up in the program.... and it worked just fine. LOL

For tons of data, you can quickly reach the level of probably needing a programming-level, or terminal/command line level solution to handle parsing efficiently.

As for me, personally I tend to reach for Sublime Text most frequently. There are plenty of other great text editors, including some that are far more powerful. What helps, though, is being able to use things like simple CLI flags (say if you're on Linux or OS X, or a custom Windows terminal setup). Or having a text editor, programming language, or other application that can utilize a bit of regex code.

For example, with Sublime, I haven't had much of an issue doing things like loading up a CSV or other text file, and then running regex searches on it. Depends a bit on how complex the regex is.

In other cases, say on a machine with Unix, Linux, OS X, or a custom Windows terminal that has Unix-oriented commands (Cygwin, for example), sometimes you can get by running a grep search from the terminal.

Just curious, @pluto, but do you have any experience with a programming language or CLI tools? Let us know what you're working with and maybe we can figure out an easy solution.
 
Joined
Feb 16, 2019
Messages
18
Likes
6
Degree
0
Right. Ryuzaki's recommendations are well worth considering. Stuff like that can often make a nice improvement in crawling and indexing with a large site.

As far as actively re-crawled pages, yes and no. Depends on whether they're the right types of pages. If it turns out the bots are getting stuck on things like profile pages, you'll probably want to start increasing the things you're blocking with nofollow/noindex, potentially robots.txt, etc.

Another thing that can happen is, a large site could end up lacking in what I call Intermediate Category Pages. Around this time, last year, we had a decent thread on here about site structure with large websites. I'd recommend checking it out for more ideas. These types of pages are great for doing exactly what you mentioned; increasing internal linking and reducing click depth. When large sites lack these types of pages, you'll often find a much greater number of pages with deeper click depth, which almost always performs worse.

Another good thread, several months ago, we covered a number of strategies for organizing website taxonomies. Much of it was geared towards Wordpress, though the tactics are applicable regardless of the tech a site might use.

As far as traffic logs, if you're comfortable posting this, what kind of volume in "accessed" pages can you see over say a 7 day period? With large sites, one of the annoying things is dealing with the limits of MS Excel. It's limited to 1,048,576 rows and 16,384 columns. Columns usually aren't the issue, especially for something like traffic logs.

Thankfully, there are some options for GUI-based editors that can handle large text files. For CSVs, for example, I've had great success with Delimit Pro. It's well worth the price. At one point, I think I had something like 13GB of raw traffic log dumps, or backlink profile dumps (I forget) loaded up in the program.... and it worked just fine. LOL

For tons of data, you can quickly reach the level of probably needing a programming-level, or terminal/command line level solution to handle parsing efficiently.

As for me, personally I tend to reach for Sublime Text most frequently. There are plenty of other great text editors, including some that are far more powerful. What helps, though, is being able to use things like simple CLI flags (say if you're on Linux or OS X, or a custom Windows terminal setup). Or having a text editor, programming language, or other application that can utilize a bit of regex code.

For example, with Sublime, I haven't had much of an issue doing things like loading up a CSV or other text file, and then running regex searches on it. Depends a bit on how complex the regex is.

In other cases, say on a machine with Unix, Linux, OS X, or a custom Windows terminal that has Unix-oriented commands (Cygwin, for example), sometimes you can get by running a grep search from the terminal.

Just curious, @pluto, but do you have any experience with a programming language or CLI tools? Let us know what you're working with and maybe we can figure out an easy solution.
This is really great thank you!!! At a skim, it looks like the accessed pages is in the high hundreds of thousands, not quite at a million. I know some basic regex (and know enough to use StackOverflow to figure out the rest) and usually use Sublime Text as my default for parsing and manipulating large amounts of data as well. I wouldn't say I'm a great programmer, but I know some basic Python and Javascript. Enough to script things.

Those links were great. I think my plan of action for the site structure after reading all of the links and responses above is:

1. De-index thin content that might be wasting crawl budget: As of now profiles seem to be the biggest culprit. Some of the user posted stuff might be thin as well. Will look into some filters to make it so that pages that are junk quality can be de-indexed.

2. Figure out a way to do categories + subcategories: This is a bit harder because all of their content is created by users. They do have some tags and categories and stuff, but outside of the greater level of categories, it's harder to build the tree with so many types of tagging mechanisms. Probably would need to make some sort of massive hierarchy tree of all the existing categories/tags on the site. I suppose I could also just implement a simpler tree for them and have them push those categories on their users. Or some of the existing tags into categories I set.

3. Add interlinking between the pages: Put as many categories and sub-categories on the home page. Have sub-categories link to parent categories, children categories, and other related categories. Stuff sub-category pages with content and links to the user generated pages

4. Make a sitemap for them: It looks like the one they have on file isn't programmatically updated and out of date