I want to get serious about automation

Nat

Joined
Nov 18, 2014
Messages
555
Likes
345
Degree
2
I've known the importance of automation and bots for a long time, over 10 years actually. I was a kid running bots on Runescape a decade ago. It took me 5 years until I had the motivation to build some type of bot myself (not for Runescape) but I never really learned enough to keep my interest/motivation up. Several years ago when I read @CCarter 's traffic leaks thread and he mentioned a bot army, I got excited again. I played around with some bots but it never really went anywhere. I coded a simple Reddit monitoring bot in Perl and Python using the api. That stuff is kiddy play, I want to really start automating and botting platforms that haven't become popular (already heavily spammed) yet. I recently got motivated when I was reading about Instagram spamming.

I'm not new to coding, but I want to start from scratch with botting/automation because most of my experience is with Java programming basic utilities that dont even use the internet. I know @CCarter said
CasperJS/PhantomJS with Perl, Perl or Python standalone
I'm not familiar enough with JS in general to just hop into the Casper quickstart and go from there. If spending a week on general JS tutorials would educate me enough, I'm super willing to do that. I'm also not familiar with using programming languages in combination with each other (aka Casper with Perl). Are there any resources/tutorials 'for dummies' out there that will really get me on the right path?
I'm not trying to get called out for being 'lazy' I just want to make sure that I spend my time learning the most useful things as quick as possible. I also know this stuff can be done a lot of ways. Casper and Phantom are apparently very powerful for not being detected as a bot?

What are my goals? I'm not trying to go after any of the big sites like Pinterest/IG, but I want to start finding websites with a lot of traffic and no bots that I can start light spamming. I'm thinking one of the first bots I would like to build would just be an auto-follow bot on a new-ish instagram spinoff site or even just a profile viewer bot. Eventually I want to get to the point where I can code a really sophisticated bot that can do some of that next-level stuff CCarter talked about - 'bot armies.'
 
Not being detected as a bot can be done with any web scraping, be it PHP, Python, wget, etc...

Go for python, IMHO.
 
If you like Python, give http://scrapy.org. The only times you tend to run into trouble is if a site has taken extreme measures not to be scraped - e.g. you see this a lot in music gig sales.

It's also an incredibly quick and easy early-warning system for automated tests. Even something as basic as writing a bot that loads up critical pages on your sites, and running it automatically after every code release will catch some issues for you sooner or later.
 
  • Like
Reactions: Nat
Not being detected as a bot can be done with any web scraping, be it PHP, Python, wget, etc...

Go for python, IMHO.
While I am interested in scraping, I'm mostly interested at botting automation. Aka Instagram following / liking etc but I don't want to go after Instagram.
If you like Python, give http://scrapy.org. The only times you tend to run into trouble is if a site has taken extreme measures not to be scraped - e.g. you see this a lot in music gig sales.

It's also an incredibly quick and easy early-warning system for automated tests. Even something as basic as writing a bot that loads up critical pages on your sites, and running it automatically after every code release will catch some issues for you sooner or later.
I've heard of Scrapy several times. It's one of the things I am looking into.. Is it mostly just used for scraping? Can it automate accounts, etc?
 
I've heard of Scrapy several times. It's one of the things I am looking into.. Is it mostly just used for scraping? Can it automate accounts, etc?

You can't automate accounts without doing some scraping along the way :smile:

Frankly, if you're bored, you could automate accounts with a bit of command line scripting and curl alone. Hell, you could type raw HTTP requests using nothing more than telnet.

All things like scrapy, PhantomJS etc. do is give you a framework that takes away the pain of doing things by hand. Need to grab something made via an ajax call and not forget about cookies/referer strings? Life is good! PhantomJS being detected and things rapidly heading south? Probably a good idea to drop down a level and see what you can force through with Scrapy.

Ultimately, like most things IT, the best way to find out what's best for which situation is to find a couple of use cases and put together a quick and dirty proof of concept, and take things from there.
 
You can't automate accounts without doing some scraping along the way :smile:

Frankly, if you're bored, you could automate accounts with a bit of command line scripting and curl alone. Hell, you could type raw HTTP requests using nothing more than telnet.

All things like scrapy, PhantomJS etc. do is give you a framework that takes away the pain of doing things by hand. Need to grab something made via an ajax call and not forget about cookies/referer strings? Life is good! PhantomJS being detected and things rapidly heading south? Probably a good idea to drop down a level and see what you can force through with Scrapy.

Ultimately, like most things IT, the best way to find out what's best for which situation is to find a couple of use cases and put together a quick and dirty proof of concept, and take things from there.
Sorry, I typed the last response on mobile and was brief. I didn't mean to say scraping wasn't relevant but that my main purpose is to automate liking/following/commenting on some social websites.

So, you're saying that Scrapy is 'below' PhantomJS/CasperJS? I'm really trying to learn more about how bots are detected on websites besides the obvious things like ridiculously high activity, etc. I was under the impression for some reason that because Casper handles javascript websites and that it was a headless browser it looked less suspicious than other options for 'botting.'

I want to start with some small projects to start learning the basics of non-api automating/scraping. So, I'd like to build a bot that auto-follows people on angel.co.
 
http://engineering.shapesecurity.com/2015/01/detecting-phantomjs-based-visitors.html gives a fun overview of common detection methods for PhantomJS - and countermeasures ;-)

Casper uses PhantomJS behind the scenes. I believe you can also set it up to use Firefox instead of Phantom.

A lot of your success in doing this will come from understanding how websites themselves work. How do cookies work, what's an http referer (deliberate misspelling, you'll learn why) and what other typical http headers are in use and why, what other web storage options are available in JS, what part does ajax play for a given site etc. etc. etc.

A good first step for a lot of this should be doing everything in a browser, by hand, and using a man-in-the-middle proxy like Charles or Fiddler to see if anything obvious jumps out about how stuff works. Some people prefer diving right in with e.g. Chrome's developer tools. I think it comes down to personal preference.

If you have a solid grounding in these already then you're good to go!
 
http://engineering.shapesecurity.com/2015/01/detecting-phantomjs-based-visitors.html gives a fun overview of common detection methods for PhantomJS - and countermeasures ;-)

Casper uses PhantomJS behind the scenes. I believe you can also set it up to use Firefox instead of Phantom.

A lot of your success in doing this will come from understanding how websites themselves work. How do cookies work, what's an http referer (deliberate misspelling, you'll learn why) and what other typical http headers are in use and why, what other web storage options are available in JS, what part does ajax play for a given site etc. etc. etc.

A good first step for a lot of this should be doing everything in a browser, by hand, and using a man-in-the-middle proxy like Charles or Fiddler to see if anything obvious jumps out about how stuff works. Some people prefer diving right in with e.g. Chrome's developer tools. I think it comes down to personal preference.

If you have a solid grounding in these already then you're good to go!

Thanks for that link! Extremely informative!

I hate to ask you for anything else because you've been so helpful. But, are there any websites you can point me towards to help me get a better grasp of these different concepts? (Like learning how to better recognize which things should stand out). Or any good tutorials on this stuff in general?
 
No idea, I'm afraid :wonder:

You know, it might make sense to look at it from a slightly different angle: automated tests.

The tool chain is similar, and you're doing the same job, it's just what you do with the results that differs.

It might be painful to start with Selenium. If so, crack out something with Codeception to learn the basic concepts and go from there.
 
  • Like
Reactions: Nat
@MooFace I built my first working bot (without api use) yesterday! Took me pretty much the entire day because a lot of it was reading, a good amount of de-bugging weird errors, etc. Its nothing special, but it ran all night without any errors! I wrote it in Selenium with Python. Still a lot of work that I can do to make this bot better as it isn't very flexible or full of features, but I'm just happy I got something running. The social website it automates isn't really anything too useful to me, I mostly picked it for practice. (Trying to learn without getting myself insta-banned from anywhere important.) I'm just going to keep practicing and coding

I also spent a good bit of time reading up on bot detection... WOW, places like Distil networks... the big players who want to stop bots have some serious software and there don't seem to be many public posts/resources about getting around it.
 
I know your focus, Nat, is more on the automation side and probably less about scraping, but I just thought I'd mention this in case it helps anyone out. If anyone is considering getting into making bots, I also recommend Scrapy. It's a nice, detailed framework that's easy to work with. The cool thing is, to do rudimentary things and just get started learning to code, you really don't have to know a lot. Maybe learning a bit about Xpaths, and maybe some stuff like writing FOR loops, and people can get quickly started scraping stuff from anywhere. Even if someone's focus is automation, Scrapy might still be useful from an educational standpoint, in terms of learning more with a programming language, and starting to solve problems.

Also, in the long term, there's always the potential to incorporate other elements in Scrapy, such as pulling in other code to perform non-standard functions. For example, you could use Scrapy as the framework, since it's really efficient with its "middleware structure". So that way it's really easy to solve the problem of certain things like rotating user agents, rotating proxies, etc. that might be more difficult with other languages and frameworks. There's also tons of code repos on GitHub, specifically for Scrapy, that are pretty much ready to go out of the box. Then you'd just import whatever other library or code base necessary, to run other functions maybe more on the site automation side of things. I confess, I've not gotten into that much myself with Scrapy, as I've focused on using it for only a few of its typical purposes.
 
I know your focus, Nat, is more on the automation side and probably less about scraping, but I just thought I'd mention this in case it helps anyone out. If anyone is considering getting into making bots, I also recommend Scrapy. It's a nice, detailed framework that's easy to work with. The cool thing is, to do rudimentary things and just get started learning to code, you really don't have to know a lot. Maybe learning a bit about Xpaths, and maybe some stuff like writing FOR loops, and people can get quickly started scraping stuff from anywhere. Even if someone's focus is automation, Scrapy might still be useful from an educational standpoint, in terms of learning more with a programming language, and starting to solve problems.

Also, in the long term, there's always the potential to incorporate other elements in Scrapy, such as pulling in other code to perform non-standard functions. For example, you could use Scrapy as the framework, since it's really efficient with its "middleware structure". So that way it's really easy to solve the problem of certain things like rotating user agents, rotating proxies, etc. that might be more difficult with other languages and frameworks. There's also tons of code repos on GitHub, specifically for Scrapy, that are pretty much ready to go out of the box. Then you'd just import whatever other library or code base necessary, to run other functions maybe more on the site automation side of things. I confess, I've not gotten into that much myself with Scrapy, as I've focused on using it for only a few of its typical purposes.

Super useful/helpful! Thanks for posting! Automation and scraping really do go hand in hand. All scraping is automation, but not all automation is necessarily scraping... but almost always needs to.

Proxy integration/rotation is a really important part of upper-level automation... its probably what I need to start reading up on pretty soon.
 
Proxy integration/rotation is a really important part of upper-level automation... its probably what I need to start reading up on pretty soon.

Well, as far as Scrapy goes, I just added in a post with how to get proxy rotation up and running pretty quickly. :-)

For everything else, usually proxy rotation tends to follow a pretty similar concept from language to language. It usually seems to boil down to these things:

  • A "middleware"
    • Basically the code that does the work of accessing and rotating
  • A source list
    • Usually a single line variable that's a list, tuple, etc. referencing the items or where they're located
    • EX: PROXY_LIST = '/path/to/proxy/list.txt'
    • or PROXY_LIST = [1.2.3.4:100, 1.2.3.4:101, 1.2.3.4:102]
  • A control flow statement for iteration
    • FOR loops are a common example for many languages
    • In other words, pretty much: FOR each -> URL in list -> Randomly select a proxy from the list

The control flow statement, whatever type it ends up being, is usually going to be in the "middleware", which is typically just a separate module from whatever main file you're working with. You can easily just include all of the logic for all 3 of the above items within your actual primary file, but it's usually a good idea (at least for scaling) to try and "modularize" your code where possible, to make things more manageable. That or have a 50,000 row single file to work from. ;-) Anyways, hope it helps.
 
  • Like
Reactions: Nat
Back