I need to download a "few" pages directly from Google...

Andrewkar

...
BuSo Pro
Joined
Nov 6, 2014
Messages
401
Likes
217
Degree
1
I'm migrating a Joomla K2 site to Wordpress site. The problem is that one of Joomla categories is a bit large, around 2000 posts. I've been using Joomla plugin Export/Import tool to export CSV file with contents, and then after some changes done to the CSV file I was importing it to Wordpress. All nice and cool, but I can't export 2000 posts as a CSV file because server crashes each time I try. Support can't do anything, they did database copy for me but this won't work (Joomla is using K2 plugin and images could be lost during migration - and I want those images).

So I'm thinking about getting those articles directly from Google. They have indexed around 800 posts from this category and this is what I want to migrate (for now). Do you have any ideas how to scrape Google to get those pages? I have already URLs of those pages, and now I need some tool to grab contents, any suggestions?
 
I'm migrating a Joomla K2 site to Wordpress site. The problem is that one of Joomla categories is a bit large, around 2000 posts. I've been using Joomla plugin Export/Import tool to export CSV file with contents, and then after some changes done to the CSV file I was importing it to Wordpress. All nice and cool, but I can't export 2000 posts as a CSV file because server crashes each time I try. Support can't do anything, they did database copy for me but this won't work (Joomla is using K2 plugin and images could be lost during migration - and I want those images).

So I'm thinking about getting those articles directly from Google. They have indexed around 800 posts from this category and this is what I want to migrate (for now). Do you have any ideas how to scrape Google to get those pages? I have already URLs of those pages, and now I need some tool to grab contents, any suggestions?

First of all, find out why the server is crashing. Is it a memory issue? Up the memory (or php.ini settings).

Second, you could get this directly out of mysql without too much pain at all. The db is probably storing where the images are located on the server, so you can easily suck down just the part you want and kick out a CSV formatted exactly how you want it for wordpress import.
 
It's because of I/O disk usage. The problem is that hosting provider can't lift those limitations. With database is another challenge because this Joomla site is using K2 extension for content creation, and K2 content is stored in DB differently than regular Joomla contents. I would have to convert K2 database to Joomla format, and that would be fine, but this action will change look of articles on the site, who knows how they are going to look after that conversion... It might be disaster so I won't go that way. Besides K2 is storing image names as numbers... But I thinking now about creating few more categories, copying portions of content into those categories and then exporting as smaller files, this should work out fine I guess. I will update soon, Thanks
 
I know this isn't what you're asking, but since it's a one-time ordeal, you might consider generating a list of all of the URLs and busting it into chunks. Then finding someone from the Philippines or wherever that is good with data entry / collection. Someone would be more than happy to sit at their computer watching TV shows while copy and pasting all day. Or you could bust it into 4 or 5 chunks and get it done faster with that many people going at it. You can give them all log-ins to a Wordpress that has the proper categories set up and let them paste and publish.
 
This is good idea as well and I was thinking about it actually. For now I managed to export those articles in chunks (10 files in total) so server didn't crash. So far so good however, new challenge arises :smile: ... Some of articles in CSV files are messed up soo badly that after import to WP I have HTML in post titles LOL. It's just a few lines in each file, but I will have to go manually and check around 40 files to make necessary corrections. If that doesn't work as I want it to work, I will outsource it for sure. Now I know why Joomla to WP migrations might get a bit expensive at times. It's my first job of this kind, and it's for my friend's website. He is kind of a startup with not many monies... I want to help him, especially now while he is on a very limited budget. He is a solid guy, and I admire him for having balls to start it at all, with no experience in online world/marketing/SEO whatsoever. We will get there one way or another. Thanks for inputs guys, and I will report back once is done.
 
Several possible things:

  1. Wget the whole site, rate limited if necessary = Launch the site locally and export what you need
  2. If Python isn't an issue, this can work really well and is simple to setup: Link
That Python script is a Google/Bing SERP scraper, as well as cached page scraper. Bing, in particular, has a difficult cached URL scheme that is encoded, so you can't just guess what the URLs are. Setup and run the script to grab the cache page URLs, then plug those URLs into a Scrapy bot or some other bot and grab the entire page of each. Just beware, Google is extremely aggressive with blocking/banning IP's and in some cases entire IP ranges. I would lean towards Bing first, and only resort to Google if I absolutely had to.

Beyond that, there are several Wayback Machine scrapers that might do what you need, if your site is indexed there. If you can make Wget work, I'd go that direction, because that will probably be the easiest. If absolutely necessary, just rate limit the heck out of it, and there are also flags to not download certain things like resource files, eliminating a lot of the overhead.
 
@turbin3 Thanks! We are almost done :smile: Still few things to get done, but finish is almost visible.
 
Back