What's the best Python option to scrape Javascript generated content?

bernard · Jun 14, 2020

As in title, what is the current best method to scrape content that is generated by javascript?

infotech · Jun 15, 2020

I have used pyppeteer in the past. There is also another project from scrapinghub on github.

I had to use a browser based scraper because of anti-bot mechanism.

bernard · Jun 15, 2020

I tried pyppeteer with the request-html method, but it didn't work, only got the raw html pre-load. Other people say it isn't supported anymore. Not sure.

BCN · Jun 15, 2020

Selenium (there's a web driver for it for Python too) works, but you need to spawn a headless browser for it vs. scraping with requests.

bernard · Jun 15, 2020

If anyone can do this for me, for a smaller sum, including delivering source code, then hit me up. Scrape 5 category pages and grab usual product data.

DanSkippy · Jun 18, 2020

bernard said:
If anyone can do this for me, for a smaller sum, including delivering source code, then hit me up. Scrape 5 category pages and grab usual product data.

is this a returning gig? If so, have you tried scrapeulous?

bernard · Jun 18, 2020

I've figured it out from my Mac so far using Selenium and BS4. Works fine. Not sure how it works on Pythonanywhere.

Brit in Cambodi · Jun 21, 2020

Why use Python for this?

You would be making the job far simpler using node puppeteer for this.

If you're scraping chances are you have at least a moderate understanding of Javascript its not to hard to learn node or at least get it to the stage where you can run a puppeteer client.

BCN · Jun 22, 2020

bernard said:
I've figured it out from my Mac so far using Selenium and BS4. Works fine. Not sure how it works on Pythonanywhere.

Yeah, it can run on any server. I run some selenium stuff on Linode. Just set up a normal server, install what you need, and it can run there just like on your own computer.

WinMore · Jun 25, 2020

Brit in Cambodi said:
Why use Python for this?

You would be making the job far simpler using node puppeteer for this.

If you're scraping chances are you have at least a moderate understanding of Javascript its not to hard to learn node or at least get it to the stage where you can run a puppeteer client.

Yeah, I've recently started using puppeteer. It's quick and easy to learn. You just need some basic javascript knowledge and you are good to go. Did it all in Visual studio code.

My first scraping project took like a day, maybe 2 days if you count the optimizing.

Still a lot to learn, but I'm astonished. This is almost too easy.

bernard · Jun 25, 2020

Where does Puppeteer run?

maximus · Jun 26, 2020

if you don’t want to code why not use a saas service like https://www.octoparse.com/
There are quite a lot of these kind of services.

most have free plans with reasonable usage.

FIREman · Jun 30, 2020

I suggest looking at the underlying source code for json data stores, or looking at the http requests made for the api endpoints. Usually you can skip the entire browser automation stage, which is brittle and has high maintenance cost

bernard · Jun 30, 2020

FIREman said:
I suggest looking at the underlying source code for json data stores, or looking at the http requests made for the api endpoints. Usually you can skip the entire browser automation stage, which is brittle and has high maintenance cost

How?

FIREman · Jun 30, 2020

To check for json data stores on a server-rendered webpage:

right click on a web page in your browser. Select "view page source"
find the script tags with json-serialized objects that contain dynamic content. The giveaway is usually the type attribute being set to "application/json", or having a "hardcoded" js object/variable in the script. The new reddit homepage does both for data loading

You can then scrape the tag using normal scraping methods and load the data as a json decoder in whatever language you prefer.

To check for api requests made on page load:

right click ona web page in your browser. Select "inspect page"
click on the network tab. Disable cache. Reload the page.
observe all requests being made by the page.

To check api requests made on js interaction:

right click ona web page in your browser. Select "inspect page"
click on the console tab. disable xhr filtering (meaning you want xhr logs to show in the console. It is usually disabled by default)
Perform interaction. Observe all xhr logs in the console.

Caveat is that this is not for sites have high bot protection, it is best for public content sites. The browser automation option would be best for high security situations, as there are 101 things a webmaster can do to detect bots, like honeypots or interaction tracking.

bernard · Jun 30, 2020

I tried to do it, but couldn't figure it out. It was some sort of aggregated backend on another domain, that run a bunch of webshops.

Above my paygrade.

BCN · Jul 1, 2020

If you can fetch the API data like mentioned above, it becomes super easy. Sometimes you can use Selenium just to get the login cookie, store it, and inject it when you do the request.

I've done this to scrape ie. directories, people-finder type sites, government sites etc. If you get the JSON, you can even scrape it directly to Google sheets.

If you want it for Google sheets, I can share a bunch of code.

FIREman · Jul 3, 2020

BCN said:
If you want it for Google sheets, I can share a bunch of code.

There's actually the IMPORTXML function that let's you scrape web pages using xpath as arrays. You can then use array functions to clean the data. A super useful tool, really nice for prototyping. Plus, the requests are made by Google, so you don't have to worry about IPs or proxies.

BCN · Jul 3, 2020

FIREman said:
There's actually the IMPORTXML function that let's you scrape web pages using xpath as arrays. You can then use array functions to clean the data. A super useful tool, really nice for prototyping. Plus, the requests are made by Google, so you don't have to worry about IPs or proxies.

Yes, I like to use Google apps a scripts though. Allows you to set headers and also manipulate the response.

Especially after they allow modern JS syntax like classes, async/await etc, its very clean to work with.

Will share some code examples today

What's the best Python option to scrape Javascript generated content?

bernard

infotech

To each his own

bernard

BCN

$$$$ ¯\_(ツ)_/¯ $$$$

bernard

DanSkippy

bernard

Brit in Cambodi

BCN

$$$$ ¯\_(ツ)_/¯ $$$$

WinMore

bernard

maximus

FIREman

bernard

FIREman

bernard

BCN

$$$$ ¯\_(ツ)_/¯ $$$$

FIREman

BCN

$$$$ ¯\_(ツ)_/¯ $$$$