Onsite Analysis Of Links In Body

Stones

BuSo Pro
Joined
Nov 3, 2018
Messages
142
Likes
77
Degree
0
Am I missing something? I've taken a look through ;
  • Screaming Frog ( not the latest version admittedly )
  • Deep Crawl ( only the trial )
  • SEO Powersuite's offering
  • Sitebulb ( really like )

And none of these seem to allow or have the functionality to ignore nav links, sidebar links and links in the footer.

I'm trying to implement competitor analysis of how sites are crosslinking content for various purposes including on site anchor text optimization.

If there's nothing fit for purpose, has anyone rolled their own with Apache nutch or via Python scrapy?


It seems like such an obvious task that I reckon I must have missed something.
 
Last edited:
In wordpress it is extremely easy to figure out where the content/article starts since it is wrapped in a div with a class like .content or .article. So if all websites were 100% wordpress it would be simple.

However only like 20% of sites are wordpress.

You as a human with eyes can distinguish where the navs, sidebars, and footers are. And even some sites use sections, header, aside, and footer html code. But you can’t guarantee that every single website you come across has the same underlying technique for distinguishing each section.

That is the problem, every website is coded differently by different coders with different skills speaking different languages and having different objectives.

Everytime you come across a new site you’ll need to customize the Xpath to exclude these scenarios cause remember these softwares are simply reading the source code and do not have human eyes and brains to distinguish sections well.
 
You can get some reasonable results by tagging (and excluding) links that:

  • appear in semantic html elements (header, footer etc)
  • appear in elements with certain common classes (.header, .menu etc)
  • appear on more that a certain % of pages

I have a bunch of custom code for this type of thing, that i have largely abandoned, in favour of surferSEO and Getting Shit Done.
However, with the majority of our hospitality and luxury travel clients closing / going into stasis, i suddenly find myself with more time on my hands, so will probably be working on something interesting
 
Integrity for Mac can do what you're asking. When you crawl a site, you can populate a list of links to ignore. You have to do it manually though. So when I crawl my own sites, I ignore certain pages, links out to social networks or social sharing buttons, RSS feed, etc.

Alternatively, you should be able to visually identify any "boilerplate" links with any spider in the list after the scrape. If you sort by the number of pages they appear on, it should be very obvious which are which. Some might appear 100's or 1000's of times while others maybe a dozen. There will be some ratio where it's obviously sitewide to some degree.
 
Thanks for the replies all. I tried several different things this morning including excludes of boiler plate pages in Screaming Frog, messed around with the Custom > Extraction. Which is a nice scraping feature I wasn't aware of. Went down a bit a rabbit hole looking for something suitable. No joy.
 
If it's your own website, I'd just add a data-something attribute to internal links and filter it that way.

If you have a decently built theme, likely your content body will be inside of an <article> tag, and your navigation menus will be inside <nav> tags.

If it's someone else's website it gets much trickier, because you have absolutely no control over all the crazy ways people code their websites.

Probably a good way to do it if you're not going to parse a ton of websites is to find a library that can extract the main content, and then only parse that - but you will still get a lot of false positives. Not sure if it's worth spending too much time on?
 
If you are familiar with python look for readability(https://github.com/buriy/python-readability). It will extract only the contextual text from the page and you can parse it however you see fit.

It works on 99% of the cases but for some, it does not as it's based on common text CSS wrappers.

I've implemented what you are trying to achieve some time ago. I can tell you that it's time wasted, there's a lot of noise and variables. Recently tried to replicate the same internal structure(content, keywords, links...) on small niches, it does not justify the work. Some sites increase while some don't, it's a "random" factor that you can not anticipate.

I think that the concept of blessed sites discussed on the forum does not only applies to the beginning of the website but through his entire "life".
 
One of the things that I was interested in looking at was cross page linking of #ids which further complicates the matter. Like @ianovici mentioned, after looking at the work needed to implement and then present the data in some sort way that would be useful, probably won't be worth it. I'll take a look at what opensource projects are out there to see if there's a shortcut. @BCN , I 've put that module on my list of things to look into. Thanks
 
Back