Web crawling / scraping

Table of contents

Child pages


General info

 

Data sources

 

 

Captchas

Beating captchas

Information

Providers

  • DeathByCaptcha
    • Here's a copy of their API, since it's behind their paywall: DeathByCaptcha - API.pdf
    • On 2018.01.02 I was helping someone figure out how to get DeathByCaptcha working for the "I'm not a robot"-type captcha. The DBC API was confusing, but eventually the guy figured out how to do it, and it looks like he was using the "Token API", which you can see here (PDF back-up).
      1. The docs said to look for a "sitekey" in the HTML, but I couldn't find one, and so I assumed the docs were out-of-date. Instead I used the "Network" tab to track requests, and noticed that the request to the reCaptcha server had a "k" URL parameter. I'm guessing that's the sitekey.
      2. You're then supposed to just use DBC's Python API (which you download from their website, it's just a couple of Python files) to send them the sitekey and the URL, and they return the token you're supposed to use in your form submission.
      3. But because this guy didn't want to use a browser emulator, he instead just used the Network tab to see what the POST request looked like when users would submit the form, and had his Python code send the same kind of POST request.
  • 2captcha

Testing environments


Captcha types

reCaptcha v2 ("I'm not a robot")

  • Reddit - ELI5: How do those checkbox "I'm not a robot" capchas work?
    • "If you're logged into your Gmail it assumes you're a human. If you're not you get the reCAPTCHA. Try it. We recently implemented the noCAPTHA for our high-volume online sales apps and assumed Google had some proprietary black magic at work. Nope, Gmail login."4

    • "I have done development work on bots that can get around captchas and reCaptchas. We pass the captcha to a third party service via an API, it is solved, and we can continue off the page. Nothing is fool proof, not even a reCaptcha."

    •  Click here to expand...

      I crack around 1k of those daily when am actively cracking them, it does cost me a few bucks though.

      You can use this service https://2captcha.com/, but I have to use python pillow package to join the text and the payload picture together(the instructions with the image), so the human captcha solvers can solve it.

      You could also set up a forward-feed network to solve their images, if you don't want to use a 3rd-party service and pay for it. If you are feeling adventorous you could rebuild their javascript virtual machine they are running on your browser into another language and use pure http(s) requests.

      The main big difference that makes this new reCaptcha better than the old captcha system, is that it is based on bgResponse.

      In simple terms, its a big custom made little virtual machine created in javascript that creates a hash based on your mouse input/clicks, keystrokes, user-agents, browser features etc. all with timestamps included of every action.

      It basically, takes all of this information,and moves a bunch of bytes around to get a big ass hash that is submitted in a POST request under the bgResponse variable.

      You can't just replicate that hash, unless you rebuild the javascript little virtual machine they are running on your browser in another language. Someone actually tried to rebuild it and posted some code on github, but as you can expect he was courted with a matter of days to "come and visit the office" and that was the last time the OP was heard of and his project was removed from github.

      I have limited experience as I have tried to break the bgResponse down by using sfvfomething likehttp://jsnice.org/ and rebuilding the virtual machine in python or Rust. Still working on it on my own spare time and I can tell you some parts of it are easier just a bunch of XORing going .However, ultimately if you can rebuild bgResponse that was designed by Mike Hearn during his time at google(yes that is the ex bitcoin developer), then you can do more than solve captchas.

      You can create a tons of gmail's without using a browser, you can post on youtube comments without using a browser, and you can login to any Google service without a browser.And if you have the business kind mentality, you could make a lot of money with that, think something like $$$$$.

  • 2captcha - New way of solving ReCaptcha V2
  • 2captcha - How to solve “I’m not a robot” captchas
  • 2016.06.15 - BlackHatSEO - How to Solve ReCaptcha V2
  • 2017.02.28 - East-Ee Security - ReBreakCaptcha: Breaking Google’s ReCaptcha v2 using.. Googlee/

 

Articles / videos

 

 

 

Python libraries

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Scrapyd is an application for deploying and running Scrapy spiders. It enables you to deploy (upload) your projects and control their spiders using a JSON API.

 

Crawlera

Crawlera is a smart downloader designed specifically for web crawling and scraping. It allows you to crawl quickly and reliably, managing thousands of proxies internally, so you don’t have to.

Crawlera routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems.

Accounts provide a standard HTTP proxy API, so you can configure it in your crawler of choice and start crawling.

  • It was rec'd by someone on HN here.

 

 

This has a bunch of good info:

https://news.ycombinator.com/item?id=7375575

 

2011.09.28 - StackExchange - Amazon EC2 + S3 + Python + Scraping - The cheapest way of doing this?

 

2011.10.31 - StackExchange - Most efficient (time, cost) way to scrape 5 million web pages?

 

2012.08.10 - Michael Nielsen - How to crawl a quarter billion webpages in 40 hours

 

2013.03.26 - Seminar.io - Running Scrapy on Amazon EC2

 

2014.03.09 - Jake Austwick - Python web scraping resource

In this article I'm going to cover a lot of the things that apply to all web scraping projects and how to overcome some common gotchas.