X Marks The Spot - Using XPath to Scrape Your Way to Success

turbin3 · Jan 16, 2016

I apologize in advance if this is far too basic for some of you. The other day I was remembering when I first learned the beauty of XPath and being able to use any number of programs to scrape a specific item or set of items from a specific page. I figured I'd put this up to encourage people, that may be a bit more novice, to expand their horizons.

What is it?

So what is XPath? It is exactly what the thread title says. Think of it as a "pathway" through a file, which leads directly to a specific element within that file. This could be HTML, PHP, XML, or any number of other resource file formats. It is a method of calling out specific elements, attributes, and other items to effectively give a program directions in order to reach the specific piece of data you're interested in. In certain cases, you can jump right to the "X", skipping most of the "pathway". In other cases, for specific uses and specific pages, it may be necessary to create a detailed XPath that leads the bot from start to finish through the document. The more unique and definable something is, the more easily you'll be able to go directly to it as opposed to having to use a complex XPath. It can be used for many different purposes, but I will be speaking about XPath in the specific context of scraping data from a webpage, which is an invaluable tool you will want to learn as an SEO, digital marketer, and web developer.

Why you should care

Inevitably, when it comes to web development, database development, lead generation, competitive intelligence, and general analytics, you will eventually find yourself in need of data to accomplish your goals. In lieu of having that data already, you must acquire it from somewhere. You can pay for it, but that's not always fun and it's not always affordable. Well I have news for you. You are quite literally surrounded by more data than human beings have ever had access to in the entirety of human history....and it is right at your fingertips, FOR FREE, and can be acquired within a matter of seconds or minutes. It is often a matter of simply thinking a bit outside the box, and figuring out a way to creatively acquire it.

Be Prepared

A bit of a caveat, to start off with. Be prepared to FAIL. Be prepared to be IP banned from websites for being an unforgiving scrapist. Be prepared to become a bit frustrated from time to time, trying to figure out the "recipe". XPath is often confusing, and takes a lot of trial and error to utilize successfully. Where you will achieve success is in learning to diagnose those failures, reassessing, self-education, and through persistent trial and error. Just keep tweaking those XPaths until those empty fields come streaming in with data. The first time you create, troubleshoot, and successfully achieve an XPath that nets you data, on your own, a whole new horizon will open up for you. You will begin to realize that nearly anything is within your reach, with enough trial and error. It's a very liberating feeling.

So that's what XPath is, and what it can generally do, but how do we use it? I'm going to focus on a simple method that ANYONE reading this will be able to get up and running within minutes. If I can find the focus, I'll probably also be putting up some tutorials on developing your own scrapers in Python, using XPath, among other methods.

Resources & Examples

W3C has a few good resources to get you started with making sense of XPath and how to create/find them for a given page element:

https://www.w3schools.com/xml/xpath_intro.asp
https://www.w3schools.com/xml/xpath_syntax.asp
https://www.w3schools.com/xml/xpath_examples.asp

Pay particular attention to the "syntax" link. You'll probably need to refer back to that quite a bit. Also, trial, error, assessment, and self-education is paramount. I highly recommend trying as many things as you can, and Googling what you are generally trying to achieve.

You will often come across lots of StackOverflow threads, full of TONS of great info to help you develop the winning XPath combination that will net you that piece of data. For example, "xpath anchor href contains" might be something I would Google, if I was trying to develop a more general XPath that would look at all of the hrefs on a page, and would scrape only the ones that contained certain text or certain attributes.

A further example of this might be, if you wanted to develop a general XPath you could utilize to find any href linking to facebook.com, twitter.com, plus.google.com, youtube.com, etc. With that, you could dump a seed list of domains in your chosen scraping program and have a good degree of success in scraping the social profile links for each domain. Now think of all the paid services out there that offer you that sort of ability, or similar abilities. Are you starting to see where I'm going with this? As you go further down this rabbit hole, you will come to a realization. Where you may have previously paid for certain data collection services (sales leads, for example...) in the past, you can simply spend a few minutes or hours and develop the capabilities to perform many if not most of those services yourself and almost entirely for free.

As you begin to scale, and as you target certain sites, there can be some minor expense involved. You will eventually want to take steps to protect yourself by anonymizing and randomizing your activity so as to begin operating effectively in the "wilderness of mirrors". This will involve proxies, a remote VPS so that your true traffic origin is masked, a VPN to encrypt your connection and obfuscate any sort of trace, etc. I'll probably touch on those things more in another tutorial.

Nuts & Bolts

What can we use to scrape? One of the quickest and easiest ways you can get started is with Screaming Frog SEO Spider. Download it if you don't already have it. They also have some decent pointers on creating XPaths:

http://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/#extraction

Where do we start in creating an XPath? With FireFox or Chrome, there are quick ways to get started with the developer tools window opened up. Open the site you want to scrape, then right click on the element you want, in the page, and click "inspect". Now hover over that item in the source code on the inspect window, right click, then copy the XPath. Now open up Screaming Frog, and select Configuration, Custom, and Extraction. Select XPath from the drop down, paste yours in, hit OK, then enter in your URL to scrape and run it. What did you come up with? In this example, we came up with a big, fat, NOTHING. So lets go back to the drawing board.

In this example, I'm trying to get a baller thumbnail image from a YouTube video. What you'll find with Chrome/FF developer tools is, the XPath you're given won't always be the one you need to accomplish what you want. Also, there is the separate issue that Screaming Frog has its own issues with the exact XPath syntax you use. Sometimes what you'll find is, the one you use in Screaming Frog won't work in a custom coded Python bot, and vice versa. Get used to tweaking things. In the case of Screaming Frog, I often find that it doesn't seem to like "double quotes", but instead seems to prefer 'single quotes' when you are specifying a class, ID, or certain other items.

So back to the task at hand. We want that image. This XPath didn't work: /html/head/meta[8] Was it the XPath, or was it maybe just a setting in Screaming Frog? Well we have drop down menus, so lets just try all of those options. First we used "Extract Inner HTML", but that didn't work. The thumbnail is in an href, so lets try "Extract HTML Element" instead. BAM! That worked! Though, we got the full HTML string along with it. If you have a list of videos you want to do this with, that could get really tedious doing a Find/Replace in Excel, Sublime, or whatever and trying to delete the extraneous stuff so you just have the thumbnail links. For a few items, no biggie. For thousands, much bigger deal, especially if that's including lots of other XPaths and data fields.

So lets try to grab JUST the URL and save ourselves a lot of hassle. With XPath, you'll commonly see a couple main types of XPaths. There's what I usually call the "general type", and the other I call the "specific type". With the general type, you might be calling out common page elements that are present on most webpages. In the first example, we started with: /html/head/meta[8] Because almost every webpage is going to start with HTML, should have a <head> section, and probably has some <meta> elements. Do you see a potential issue with this general XPath, though? This one is basically saying take a trip past the <HTML>, keep going past the <head>, and from the 8th <meta> tag I want you to "Extract the HTML Element" (what we selected in the SF dropdown) and show it to me. What if that image isn't consistently the EIGTH meta tag on the page, though? You might get nothing, or you might get some random HTML element. Again, not a biggie with a few pages. Absolute hell when you're scraping thousands, tens, hundreds of thousands or more pages.

So that's the downside of the "general type" of XPath. The plus side is, if you have a simple element that has a simple path, and that element remains consistent across many/all pages, a general XPath is often quick and easy to create. For example, /html/head/title will often get you the page title from most webpages. Simple, right? With SF, thankfully a lot of that is built in, and you can just do a standard crawl without having to create any XPaths.

To protect yourself for the future and ensure the integrity of your scraped data, lets take a look at the more specific XPath type. Here's an example: //meta[starts-with(@property, 'og : image')][1]/@content

This is saying, for the [1st] <meta> tag whose property starts with 'og : image', scrape the content from it. In this case, it even works if you select HTML Element or Text in SF.

What if it isn't always the [1st] <meta> tag with og : image, though? You might not want to constrict yourself like that. In fact, there are at least 1,001 ways to skin this cat. Here's another: //*[contains(@content, 'maxresdefault.jpg')]/@content

This one is saying, for anything (*) that contains a content= attribute with 'maxresdefault.jpg' in the string, scrape that content for me and return it. In that case, even if there are sometimes multiple og : image tags on a page, it will only return the one with that exact string in the filename. It just so happens that, as of right now, YT consistently lists their thumbnails with this filename. That may change in the future, but for the time being, that gives us a consistent footprint to work with, using a specific XPath (specific in terms of EXACTLY the type of element that you want), but that is not constrained by a specific position within the page. This also protects the integrity of the data you're scraping, should pages be a bit inconsistent in structure and those exact items end up being located in slightly different areas of the page.

That's great, but what else can we do with this? Well first off, the thing you'll want to realize is, SF is limited and only allows for up to 10 custom extractions. So if you're wanting to scrape more than 10 specific things from a page or set of pages, you're going to have to move to something more like Python, or other languages/programs that are more flexible. Another thing to be aware of is, there are more ways to perform similar functions than just XPath. For example, there are also CSSPaths. As the name denotes, it uses CSS elements instead. From the above example of creating a path to a YT thumbnail image, here's what that CSSPath might look like: head > meta:nth-child(40) I'm sure you can see some of the potential issues there as well. Same principles, however. You can tweak that "Rubik's cube" many different ways, both specific and general.

Possibilities

To give you some other ideas, earlier this week I used Scrapebox to scrape a few hundred keywords from YT suggest, then ~300K video links from YT related to those keywords. I then spent a few minutes creating the XPaths and a couple CSSPaths to pull this data for each of those videos: Views, Likes, Dislikes, Comments, Channel Name, Channel Link, Subscribers, Vid Description, and I think that was it. Plugged that into SF and let that run for an hour or two. BAM. A "curated" list of 300K (maybe 50-75% relevant) vids, with "metrics" to help prioritize them, to begin filling a bulk content upload for a site.

Another involved scraping millions of businesses from a few directories, both to monetize as sales leads, as well as to utilize for content in multiple ways across multiple sites. That involved Python + Scrapy + the wilderness of mirrors.

Next up, if I can manage to focus a bit in the coming weeks, maybe a crash course on scraping with Python as well as the Scrapy framework.

turbin3 · Jan 25, 2016

135 views and no replies? Come on now! :wink:

Someone has to have questions, corrections, elaborations. Lets hear them!

built · Jan 26, 2016

turbin3 said:
135 views and no replies? Come on now! Someone has to have questions, corrections, elaborations. Lets hear them!

I'd like more examples of what could be accomplished because I have no idea what I could use this for tbh

Lakhs · Jan 26, 2016

Thanks for putting it together @turbin3 Unfortunately, I'll have to second @built, I am have no idea either.

turbin3 · Jan 26, 2016

Here's one. Take your previous site, for example. One way a person could scale their content production, is to scrape significant amounts of content from other sources, and then add their own spin on it, unique content, etc. For example, YouTube videos. Here's a workflow that might be efficient (and I've used before):

1) Scrapebox KW Scraper + YouTube Suggest: Scrape relevant keywords for the topics you want.
2) Standard Scrapebox scrape of YouTube
3) Build your Xpaths like I did in the example above
4) Plug it into Screaming Frog and crawl all the scraped URLs. You now have all the meta data plus some metrics to prioritize what your best quality videos might be
5) Make a spreadsheet with all of that data. Watch a few vids, write down a few thoughts, and make a column in the spreadsheet for that. There's your unique content. Use the video description as a "quote" if you need to, to add a bit more content (try not to go overboard with what will ultimately be duplicate)
6) WP All Import: Now use WPAI to import it in bulk and schedule out posts for months in advance, or as long as you want.

Oh, and BTW, if you've also built Xpaths to pull the channel names and link to a channel, sort through your spreadsheet and see if you notice a trend. You might be able to quickly identify some influencers, find some creative ways to engage them to your benefit, and/or even work some traffic-leaking magic so you are hitting things from multiple angles...

Scrapebox will cost you $54 (discount code from BHW). WPAI will cost you $200 (hard to stomach, I know, but it can pay for itself quickly). The end result is, within a few hours of work, you could potentially generate enough content for hundreds if not thousands of posts for your site. In the past few weeks, again with just a few hours of work, I've done exactly that with one site, generating a list of tens of thousands of YT vids. In terms of beginning to build frequency of content, I'm sure you can see how that could turn out awesome. Add a few other things in there, such as scraping for different content types, good quality blog posts to curate (adding your own unique content as well, of course), slideshares, or whatever else makes sense for the subject matter. Get 2-5 content types going to keep things varied and interesting, a backlog of thousands of pieces of content, and you are well on your way, at least with certain types of content that will have certain benefits for certain user groups (not necessarily everyone).

Starting to see where I'm going with this? That's just the tip of the iceberg. Another example. Say you're in a particular niche where having a business directory could prove to be useful. Find an existing directory, build your Xpaths, scrape it even with just Screaming Frog, and you've got your own seed data to build yours. Take it a step further, build some general Xpaths that stand a good chance of finding common elements on a site (email, phone, other contact info, social profile links, etc.), and you could even take your own directory to the next level from your competitors, building company profiles and enticing "users" (business owners) to generate content for you...

Or maybe lead gen, particularly if cold calls aren't a big deal, but either way. Find a few directories, or pseudo-directories where lots of potential buyers congregate, and go to town.

And if not leads for you, SELL those leads. Plenty of people will still pay money for the convenience of not having to do that work themselves, simply buying "the list", and giving that to their people to work. The money is literally just sitting out there, waiting for you to take it.

built · Jan 26, 2016

I know what I'll be doing when I wake up

I'll give the content production a go.

@turbin3 for this content are you basically just posting YouTube videos with a paragraph or 2 of "your thoughts"?

turbin3 · Mar 9, 2017

Just and update, and unfortunately not a great one. @gasmonkeygarage just informed me that, with one of the newer free versions of Screaming Frog, they locked down most of the custom features discussed above. So unfortunately the free version of Screaming Frog SEO Spider is a no-go for those looking for a free or cheap option, that can handle Xpaths and a bit of customization. At this point, I'm not sure if there's another free, easy alternative out there for this, to help people ease into this before going into some sort of full-blown programming and scripting bots.

ryandiscord · Mar 13, 2017

Google sheets ImportXML for very small runs, SeoToolsForExcel plugin can handle slightly larger jobs. nothing quite like what screaming frog can do though.