Troubleshooting scraping news sites

algospider · Mar 30, 2020

Hello,

I am trying to create an automated newsletter from financial newsfeeds, such as:

Code:

http://feeds.bbci.co.uk/news/world/us_and_canada/rss.xml
https://www.theguardian.com/international/rss
http://rss.cnn.com/rss/money_news_international.rss
https://www.cnbc.com/id/100003114/device/rss/rss.html
https://www.cnbc.com/id/15839069/device/rss/rss.html
http://rss.cnn.com/rss/cnn_topstories.rss
https://www.cnbc.com/id/15839135/device/rss/rss.html

I am scrapping all these news and then use a summarizer to create a short overview, with a source link.

However, for some of these feeds I get no content back. I am currently using the wordpress automatic plugin to fill my post pipeline.

Do you know if these newssites are actively blocking scrappers?
Would it be better to implement the scrapper on my own using proxies?
Are there any other sources for financial news content that can be scrapped?

I appreciate your replies!

BCN · Mar 30, 2020

I'd scrape them in Python, then parse them in i.e. Beautifulsoup. I don't know how well Wordpress integrates with this type of work, I'm sure you can do it, as you can parse XML in PHP.

You can 'normalize' them in Python too, i.e. store them in a local database or a JSON object, which would be easier to parse into Wordpress, since you don't need site-specific logic.

bernard · Mar 31, 2020

Try to import them with WP All Import instead.

It's very flexible and you can write a bit of code in the plugin function editor instead of having to do an extra step with Python.

Though Python is excellent for scraping and can help you with proxies and the like if there are issues there.

FIREman · Apr 2, 2020

Try rotating your user agents, proxies, and throttling your requests.

BCN · Apr 2, 2020

If you're making a newsletter, is it really worth doing it in WP? It's just not the right tool IMO. It's for blogs, and the codebase isn't great for adding in logic, all thought they have stuff like background jobs - it's primarily a codebase for rendering content from a DB to HTML via PHP.

Some skeleton code, it's probably not working, but sort of the structure you can do. I literally just typed it into the forum here.

Python:

import requests
from bs4 import BeautifulSoup
import json


def load_rss_urls(file_name):
    """
    Load RSS urls from JSON file or CSV, return the url + name if you want to keep track of the name
    """
    pass

def get_rss_contents_from_url(url, page=None):
    """
    Fetches XML, returns object with XML data
    """
    try:
        req = requests.get(url)
        if len(req.data):
            return req.data
    except Exception as e:
        print(e)
        return
 
def parse_rss_feed(xml_data):
    """
    Parse XML with BS4, return whatever you need from the feeds.
    Check their docs. you could also use a RSS parsing library instead if the feeds are different formats
 
    parsed_xml_data = dict()
    # BS4 LOGIC
    parsed_xml_data["title"] = title_selector
    parsed_xml_data["img"] = img_selector
    [..]
    return parsed_xml_data
    """
    pass
 
if __name__=="__main__":
    # Main function, runs if you run the file directly, while you can still import the functions above in different files without running all the crap under here

    xml_urls = load_rss_urls()
    xml_objects = dict()
 
    for _ in xml_urls:
        _xml_content = get_rss_contents_from_url(_)
        xml_objects[_] = _xml_content
 
    for _ in xml_objects:
        parsed_xml_object = parse_rss_feed(_)
     
        # Append them all to a JSON file or something or send them to an api

You can keep all in a JSON file, and scrape them with maybe 50 lines of Python or node/php into a new JSON object, then just push them to your flavor of email software.

On the last part where you write to a file, you can even just append them to an HTML document as a list/table with title/link/IMG/description if you want. And you have your HTML email ready to rock'n roll

Troubleshooting scraping news sites

algospider

BCN

$$$$ ¯\_(ツ)_/¯ $$$$

bernard

FIREman

BCN

$$$$ ¯\_(ツ)_/¯ $$$$