Crawl a Website for Content

Joined
Jun 25, 2017
Messages
2
Likes
0
Degree
0
Don't worry it's my site!

We are undergoing transitions for 30 plus sites, and I need a way to crawl the sites to get back the title, meta data, content, and anything else that may be helpful. Any recommendations?

I do use screaming frog (unpaid) it is great, however I also need the content on the page (it just tells me how many words it has!)
 
Why not whip up a quick PHP curl or python script?

Even a linux wget might do the trick
 
Hire a Python programmer, it should be a pretty straightforward job. The most important piece will be nailing down your specs first: exactly what info do you need, how do you want to store it, and what do you want to do with.
 
Although this is only part of the puzzle, I posted a basic Python script on another thread about getting tf-idf scores for content. Though, it's designed more for using a source file, like a sitemap, of known URLs.

If someone wanted to do it with minimal customization, you could run a crawl with Xenu Link Sleuth, dump the list of URLs in a file, then use this script to parse them and pull out the content.

There are "easier" or more complete ways, but that would be a relatively simple copy/paste method. This may or may not be useful for you, depending on what exactly you're trying to accomplish.
 
Back