Become An Unforgiving Scrapist - The Scrapy Framework For Python

turbin3

BuSo Pro
Joined
Oct 9, 2014
Messages
613
Likes
1,285
Degree
3
maxresdefault.jpg

image via Al Sweigart's book "Automate The Boring Stuff With Python"​

So you're trying to collect data from the internet, huh? First, as we learn we usually start with manual methods, typically variations on copy/pasting our way to success. Eventually you start trying to find tools to remove steps from the process and begin collecting more data in less time. A few websites with tools here, a browser plugin there, and you start scaling up a small amount. Eventually you might start to learn some more technical methods, such as using Xpaths and different programs to scrape data from multiple pages. Eventually, it just won't be enough and won't be fast enough. The impatience is maddening at times. We live in the age of big data and machine learning, where those things and more are advancing to the level of auto/semi-automation, such that manually working in Excel is like bringing a SMART car to the 24 Hours of LeMans. For small niches, manual methods can still work just fine. For big niches/industries, and/or highly competitive ones, it's innovate or DIE. Eventually, you will want to learn a programming language, and learn how to build your own, customized, optimized, crawlers to do more work in less time. One such widely-used language, that is highly usable for many of these efforts, is Python.

Python is often erroneously referred to as a "scripting language", however it is in fact a general purpose programming language. There are websites and web apps out there that run off of it (Django being one type). You can do a great many things with Python. One of the great things about the language is it can be highly readable, and the syntax can be very logical, making it an easy language to program in as opposed to much more verbose and complex (low-level) languages such as C-based languages. In fact, Python is a compiled language that has many different compilers and libraries available to compile to C, Java, JS, and other languages. That can be really useful if you ever take on additional languages, as it can help make your code portable and importable to many others. A lot can be said of Python, and not everyone is always a fan of the language, but the main takeaway here is that it is an incredibly flexible language with significant community support and a wide array of open source libraries and frameworks available. The reason this matters to YOU, is that many of the problems you might be trying to solve have likely been solved before and already have code, frameworks or libraries available to meet those needs. Why reinvent the wheel when someone may have already built an extremely high-performance wheel that you can install in 2 seconds for free?! Github is a veritable treasure trove of free, open-source Python projects people have released, which can often solve a lot of your problems. Instead of necessarily learning a completely new (to you) language, in its entirety, from the ground up, you can simply take the same copy/pasta concept to the next level.

260px-Python_logo_and_wordmark.svg.png

Python and Scraping

Python can be used for a great many things, but one of its most common uses is for building scrapers and bots to perform various functions. For example, with a few frameworks such as Selenium and/or Mechanize, a person could build a bot that mimics a browser and a real user. Imagine your scraper providing the site a real user agent, executing javascript, utilizing cookies... There are a great many ways to build some rather creative bots to do a great many things... The thing I want to highlight here is that you can build bots and scrapers directly in Python, and without necessarily having to use many other frameworks or libraries. That being said, there's working hard, and there's working smart. I'm about to show you how you can work smart.

scrapylogo.png

The Scrapy Framework

So what's the point to all of this? Scrapy, for most of us, is a great way to start working smart. It's an open-source web-scraping framework for Python. It's quick and easy to install and get up and running. While some people like building custom bots directly in Python, without being restricted to one framework, I would say for MOST people's needs, Scrapy probably has you covered, and might be the most efficient way you can get up and scraping with Python.


Installing Python and Scrapy

In the vein of not reinventing the wheel, I'm going to hit the highlights here, linking to some other concise install tutorials I've come across.

First, download Python. I'd highly recommend Python 2.7, as that's what I'm basing this tutorial on, and that's what a significant percentage of open-source frameworks are written/optimized for. It'll probably still be a few years before a significant percentage of the industry really begins transitioning to Python 3+. Make sure you choose the appropriate version, whether you have a 32bit or 64bit system.

https://www.python.org/downloads/

Next, chose the right tutorial for what system you're using:

http://docs.python-guide.org/en/latest/starting/install/win/
http://docs.python-guide.org/en/latest/starting/install/osx/

**One golden nugget here is, when installing Python in Windows, from the installer (unless you're doing it through the command line, which will be a bit different), there will be an option to add Python to your PATH. Select this and you can skip the next section.

python2_zpsb2ys1sp6.jpg

Adding Python to Your PATH

Ensure you have Python added to your PATH. Both tutorials above cover this. If Python hasn't been installed in your Path, you will not be able to run Python commands easily from the command line. When it is in the path, anytime you have a command window or terminal window open, you can run Python from anywhere, even if you are not in the Python subdirectory.

Installing Setuptools and Pip

Ensure you have Setuptools and Pip installed. Pip is an installation tool for Python packages, and you're going to need it for Scrapy, as well as potentially some other frameworks and libraries, as you begin to grow your bots. The nice thing about Pip, is it lets you quickly install packages that you might be missing. For example, when you attempt to run a Python program or Scrapy bot, you will often see errors and the run will fail. Sometimes these errors are because a component of the program requires a certain library that you don't have installed. Just like with package managers on the Linux CLI (command line interface), you can easily install many of those packages right from your Windows CLI, terminal, etc. Here's a decent tutorial for verifying and/or installing:

http://dubroy.com/blog/so-you-want-to-install-a-python-package/

Run "pip install scrapy" from the command line to install the Scrapy framework. http://scrapy.org/


Creating Your First Scraper

First you need to determine where you're going to want to store all of your Scrapy projects, and make a directory for it. Go into that directory in your CLI (in Windows, go to the "Run" menu and type CMD). From that directory, run "scrapy startproject <yourprojectname here> to create your first Scrapy project (you don't need the greater/less than symbols). This will create the base files and folder structure for you to start from, which makes it quick and easy to get up and running. In your new directory, you'll have 3 main files where you'll get started:

-Items.py
-Pipelines.py
-Settings.py

The Major Components of Scrapy

Now I'm probably going to butcher this description, and anyone feel free to correct me if I'm a bit off as this is just my own working understanding. Basically "Items.py" is one of the final/higher level files where you more generally specify the components of your bot (think "classes", "fields", etc.). Think of it as the "what", but not necessarily the "how". There are, I believe, additional ways the Items.py can be used, but that's what I usually use it for. Here's more info to learn about Items.py:

http://doc.scrapy.org/en/latest/topics/items.html

Pipelines.py is basically "post-processing" for the things that you've scraped with a spider. When the data is actually being scraped, it may turn out that you run into many issues, such as duplicated data, unformatted or incorrectly formatted data, or some other condition of that data that is less than desirable for what you'd prefer. A Scrapy Pipeline is basically a class where you can define how to format, improve, and shape that data into a usable format for your needs. For example, you can build a Pipeline to dedupe data. You can also build a Pipeline to output your data as a JSON, CSV, or other file format you prefer. If you're scraping a lot of data, you might choose to develop a Pipeline that will post-process that data for quality control, and then export that data into a database. Feel free to let your minds run wild here, as the more data you scrape, the more you will need to "shape" that output into a usable and efficient format. If you get really in depth and creative here, maybe you might use some Pipelines to do some post-processing and file management of your data, while also separately having built an efficient and attractive front-end to query, manage, and derive insights from that data. For example, Pipelines that will refine and export to a database which you have a local or web app with a D3 Javascript front-end for, which will turn those tables into insightful visualizations which will allow you to find the hidden value of that data in the blink of an eye. That certainly sounds a hell of a lot more productive than looking at hundreds of thousands of rows of data in Excel! You can sort of think of Scrapy Pipelines as the "why" behind what you're doing. Here's more on Pipelines:

http://doc.scrapy.org/en/latest/topics/item-pipeline.html

Settings.py, as the name denotes, is the primary configuration file for that particular Scrapy project. If you choose, you could create a separate project for every single bot. You might choose to create a separate project even for just specific functions of a larger overall bot comprised of multiple projects, though for most people's uses that would be over-complicating things significantly. In many cases, for general scraper bots, you can probably combine all or most into one single project that will continue to grow over time. Don't overthink things here, and try to focus on one single project to start with. Your ultimate goal, as you learn more Python (and any programming language for that matter), is to generate highly reusable code as opposed to one-off projects that only ever get used for small edge-cases. For example, if you build a bot that will scrape the primary meta tags from a web page, using the simplest type of Xpath, this bot or at least those specific item classes will likely be successful on the vast majority of sites you might crawl. Highly reusable code pays for itself in the long run, because it prevents you from having to constantly reinvent the wheel. The easiest way to think of Settings.py is like the .htaccess file on an Apache server, or nginx.conf on an NGINX server. This is where you will configure the major components and modules available to you in Scrapy, or that you decide to build into it with your own custom modules. Here's more on Settings.py:

http://doc.scrapy.org/en/latest/topics/settings.html

Another primary component of Scrapy is the "Middleware". There are both Download Middlewares, and Spider Middlewares. You can probably guess by the name. Middlewares are what they sound like; they are frameworks that hook in between the bots and their major functions. For example, you can create a custom Download Middleware that will allow you to customize some of the major "behaviors" your bot exhibits, as well as how it processes requests and the responses to those requests. Say you want to scrape a list of URLs, except that site blocks all requests that don't accept cookies. Download Middlewares will allow you to enable cookies for your bots, as well as defining the behavior of how the bots use those cookies. You might choose to allow a bot to keep a cookie for 5 URL requests, then dump it and keep the next cookie, all in an effort to obfuscate certain traffic-modeling and blocking methods. Another example might be choosing to use the HTTP Cache Middleware to store a log of HTTP response headers. You might choose to strategically crawl certain types of pages on a competitor's site that uses complex cookies, so you can later analyze the response header logs and get a feel for what they're doing and why. Here's more on Download Middlewares:

https://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

Spider Middlewares can allow you to do things such as specifying a depth limit, so that your bot doesn't recursively crawl an insane number of pages on a site, when you might only care about a small percentage or certain subdirectories to a particular level. It will also allow you to do cool things, such as defining how the bot should handle certain types of HTTP requests. For example, say you have an extremely large site that's a bear to maintain, and you're always dealing with pages being 404'd all over the place. You might choose to create a Scrapy bot, set it to run on a schedule with a cron job, to periodically crawl your site. You could then have it ignore every HTTP 200 page, 301, 302, etc. and instead simply have it output a CSV of every 404 it detects, so you can just let it come to you and create your redirects as necessary. Are you starting to see how this Python stuff can seriously improve your life?! Really start thinking about the things you need to do, come up with ways to do them repeatedly, then figure out a way you can do that with Python, then AUTOMATE it and enjoy life. Here's more on Spider Middlewares:

https://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

So to recap, you have Items, Pipelines, Middlewares, and Settings. Those are the bulk of what I'll call the "structural" components to define, activate, and/or deactivate to setup your bots' core behaviors. One thing we haven't talked about yet is the real core of the "how" part of What, Why, and How. While you may define some of the fundamental ways in your settings, pipelines, and middlewares, the real meat and potatoes is what you will define in the bot you will need to create in the "spiders" folder of your project directory. To create your first spider, in your CLI and in the folder with your items, pipelines, and settings files, type
Code:
scrapy genspider <yourspidername> <website>
and Scrapy will generate a base spider file for you, based on Scrapy's templates. For example:
Code:
scrapy genspider testspider example.com
will create a ./spiders/testspider.py file for you, with the initial bot template setup to crawl "example.com". One thing to consider for the long term is, if you end up creating a huge number of bots in the future, and if you've started to gain success with creating highly reusable code, you might want to build some of those things into a custom template for Scrapy to generate from, so future bots can already start off on the right foot. As always, you can simply copy/paste that stuff from your existing bots, but sometimes that gets tedious to manage as opposed to generating a clean new template. Other things to keep in mind here are ways you might choose to manage your bots in the long term. For example, with the standard scrapy genspider command above, this will generate a separate Python file for each bot. In the long run, especially if most of your bots are going to be very similar, you might want to simply copy paste everything into one single bot file. This is how I usually do mine, unless certain bots are so drastically different that I don't want to complicate management of the others. Back to that testspider.py for example.com, here's what that templated code starts out looking like:

Code:
# -*- coding: utf-8 -*-
import scrapy


class TestspiderSpider(scrapy.Spider):
    name = "testspider"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        pass


So now you have Python and Scrapy installed, as well as your first Scrapy bot template created. Next up will be Building Bots, and learning how to actually create some of the functions you want to scrape various things. After that will be Learning How To FAIL, which is one of the most useful and fundamental things you can learn when learning to program.
 
Proxies + Rotation

Adding in a list of proxies, and setting up Scrapy to randomly rotate through them is actually surprisingly easy. Here's the highlights of what you'll need to do:

  • Install a proxy middleware
  • Modify settings.py
  • Add your list of proxies to a file

So, I'm going to try this a different way from my usual novel-sized posts, in an effort to be clear and concise. First off, a guy named Aivars Kalvans from Latvia was kind enough to create his own proxy middleware and post it to his GitHub repo. If you followed the instructions from earlier, you'll have Setuptools and PIP installed, so the instructions are pretty simple. First off, you can find the repo here:

https://github.com/aivarsk/scrapy-proxies

Just run this command from your command line:

Code:
pip install scrapy_proxies


Next, mod your settings.py file in your Scrapy project's directory:

Code:
# Retry many times since proxies often fail
RETRY_TIMES = 10

# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

PROXY_LIST = '/path/to/proxy/list.txt'


Now create a single text file to dump all your proxies in, and just make sure you update the path as necessary, for the PROXY_LIST variable in your settings.py. Here's an example of the proxy format for the file:

Code:
http://host1:port
http://username:password@host2:port
http://host3:port

That's it! You're up and running, and your proxies will be in use based on the configuration you added to settings.py.

Now, HOLD up just a minute. Take a few minutes to think about what you're doing, and look back at your settings.py configuration. The default code above might not be what works best for you, or might even be a bad idea for some uses. Make sure you CUSTOMIZE this to your needs.

For example, if for one project I need my bot fast and light, I'll want to minimize overhead where possible. There is a LOT of potential for overhead when it comes to Scrapy middlewares, as they're effectively the backend for processing pre/during/post activities of your bots. Say I only want to scrape what I can, and maybe I don't care about any broken or inaccessible pages. Well, I'll probably want to set retries to 0 (or you can just comment the whole RETRY_TIMES = line out) so that every non-200 HTTP status code just gets ignored.

In other cases, maybe my project absolutely must process each page in a list, but maybe there's potential of server overload or something, so maybe you stand a good chance of getting HTTP 500 errors. If that's the case, you might want to modify RETRY_HTTP_CODES to only include 500, and nothing else in the list. Really try to get a "feel" for the behavior of your bots, as well as your intended purpose, and tweak these settings accordingly.
 
User Agents + Rotation

Another one clear and concise. This one is even easier than the last. Here's the highlights:

  • Create a file for the code, in your project directory (wherever your settings.py is)
  • Modify your settings.py

First off, wherever your settings.py is, create another Python file. Name it something like rotate_useragent.py so it's easy to find. Next, add this code into it:

Code:
#!/usr/bin/python
#-*-coding:utf-8-*-

import random

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)

    user_agent_list = [\
        "Opera/9.61 (Windows NT 5.1; U; en-GB) Presto/2.1.1",\
        "Mozilla/5.0 (X11; U; x86_64 Linux; en_US; rv:1.9.0.5) Gecko/2008120121 Firefox/3.0.5",\
        ]


Those 2 UA's are just examples to show you how to separate them. Add as many as you'd like into the list, just make sure you separate them the same way, and ensure you close the last entry with a "]" closing bracket. To give you an idea, I have several thousand UA's in one list. To give you an idea of the long term potential, you could have multiple projects, say some for desktop, some for mobile, some for specific browsers only, each project serving appropriate UA's for those platforms/browsers. In essence, taking it a step further towards bypassing some of the more obvious methods of identifying and blocking bots. Don't misunderstand, though, as this is not fooling anyone that's technically proficient, aware, and actively blocking.

From what I remember, I think I had issues trying to make the user_agent_list variable function properly with a tuple/list referencing a source file, as opposed to an actual list/array within the code itself. My memory is foggy on that unfortunately, but this version will work just fine for you.

Lastly, just need to make some changes to settings.py. You'll need to modify your DOWNLOADER_MIDDLEWARES variable to include the Scrapy user agent middleware. Add these to it:

Code:
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' : None,
'yourprojectname.rotate_useragent.RotateUserAgentMiddleware' :400,

Also, make sure to comment out any instances of the standard Scrapy USER_AGENT field that are present in your settings.py. You can comment lines out by adding a hashtag # followed by a space, at the beginning of the line(s) you would like to comment out. This will ensure there's no weird conflict, and/or the USER_AGENT field won't bypass the rotation middleware.

That should be about it, to get you up and running with rotating proxies and user agents. Basically, every HTTP request your bots are firing should be rotating every single time at this point. If I've missed anything, please let me know. It's been awhile since I've setup a new Scrapy project from scratch. I have mine configured relatively how I'd like at the moment, so I've taken to mostly copy pasting entire projects, files, or variables and forgotten a lot of the initial configuration.

When in doubt, The Scrapy Project has excellent documentation I HIGHLY recommend getting used to using to diagnose problems. It's usually pretty easy to just look at the CLI, take note of an error that's been thrown, search Scrapy's docs for the specific variable, error message, etc. and you'll usually find a quick answer for why as well as a solution, or at the least a good starting place for Stackoverflow threads to search.
 
Awesome ! I was searching for a decent Scrapy tutorial two days ago and was'nt able to find one I could follow. Thanks for this @turbin3
 
Back