Data sources

http://commoncrawl.org/the-data/get-started/
- This looks great; you don't have to do any crawling yourself.
- https://github.com/commoncrawl/cc-mrjob

Captchas

Beating captchas

Information

Scraping.pro

Providers

DeathByCaptcha
- Here's a copy of their API, since it's behind their paywall: DeathByCaptcha - API.pdf
- On 2018.01.02 I was helping someone figure out how to get DeathByCaptcha working for the "I'm not a robot"-type captcha. The DBC API was confusing, but eventually the guy figured out how to do it, and it looks like he was using the "Token API", which you can see here (PDF back-up).
  1. The docs said to look for a "sitekey" in the HTML, but I couldn't find one, and so I assumed the docs were out-of-date. Instead I used the "Network" tab to track requests, and noticed that the request to the reCaptcha server had a "k" URL parameter. I'm guessing that's the sitekey.
  2. You're then supposed to just use DBC's Python API (which you download from their website, it's just a couple of Python files) to send them the sitekey and the URL, and they return the token you're supposed to use in your form submission.
  3. But because this guy didn't want to use a browser emulator, he instead just used the Network tab to see what the POST request looked like when users would submit the form, and had his Python code send the same kind of POST request.
2captcha

Testing environments

Scraping.pro - Testing Ground

Captcha types

reCaptcha v2 ("I'm not a robot")

Reddit - ELI5: How do those checkbox "I'm not a robot" capchas work?
- "If you're logged into your Gmail it assumes you're a human. If you're not you get the reCAPTCHA. Try it. We recently implemented the noCAPTHA for our high-volume online sales apps and assumed Google had some proprietary black magic at work. Nope, Gmail login."4
- "I have done development work on bots that can get around captchas and reCaptchas. We pass the captcha to a third party service via an API, it is solved, and we can continue off the page. Nothing is fool proof, not even a reCaptcha."
- Click here to expand...
  I crack around 1k of those daily when am actively cracking them, it does cost me a few bucks though.
  You can use this service https://2captcha.com/, but I have to use python pillow package to join the text and the payload picture together(the instructions with the image), so the human captcha solvers can solve it.
  You could also set up a forward-feed network to solve their images, if you don't want to use a 3rd-party service and pay for it. If you are feeling adventorous you could rebuild their javascript virtual machine they are running on your browser into another language and use pure http(s) requests.
  The main big difference that makes this new reCaptcha better than the old captcha system, is that it is based on bgResponse.
  In simple terms, its a big custom made little virtual machine created in javascript that creates a hash based on your mouse input/clicks, keystrokes, user-agents, browser features etc. all with timestamps included of every action.
  It basically, takes all of this information,and moves a bunch of bytes around to get a big ass hash that is submitted in a POST request under the bgResponse variable.
  You can't just replicate that hash, unless you rebuild the javascript little virtual machine they are running on your browser in another language. Someone actually tried to rebuild it and posted some code on github, but as you can expect he was courted with a matter of days to "come and visit the office" and that was the last time the OP was heard of and his project was removed from github.
  I have limited experience as I have tried to break the bgResponse down by using sfvfomething likehttp://jsnice.org/ and rebuilding the virtual machine in python or Rust. Still working on it on my own spare time and I can tell you some parts of it are easier just a bunch of XORing going .However, ultimately if you can rebuild bgResponse that was designed by Mike Hearn during his time at google(yes that is the ex bitcoin developer), then you can do more than solve captchas.
  You can create a tons of gmail's without using a browser, you can post on youtube comments without using a browser, and you can login to any Google service without a browser.And if you have the business kind mentality, you could make a lot of money with that, think something like $$$$$.
2captcha - New way of solving ReCaptcha V2
2captcha - How to solve “I’m not a robot” captchas
2016.06.15 - BlackHatSEO - How to Solve ReCaptcha V2
2017.02.28 - East-Ee Security - ReBreakCaptcha: Breaking Google’s ReCaptcha v2 using.. Google e/

Articles / videos

DEF CON 23 - Ryan Mitchell - Separating Bots from the Humans
- She says the way to get around the behavioral checks is going to be to spoof those behaviors.
2012.08.10 - Michael Nielsen - How to crawl a quarter billion webpages in 40 hours

Python libraries

Scrapy - Docs

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Scrapyd - Docs

Scrapyd is an application for deploying and running Scrapy spiders. It enables you to deploy (upload) your projects and control their spiders using a JSON API.

Crawlera

Crawlera is a smart downloader designed specifically for web crawling and scraping. It allows you to crawl quickly and reliably, managing thousands of proxies internally, so you don’t have to.
Crawlera routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems.
Accounts provide a standard HTTP proxy API, so you can configure it in your crawler of choice and start crawling.