Nathan Wailes - Blog - GitHub - LinkedIn - Patreon - Reddit - Stack Overflow - Twitter - YouTube
Web crawling / scraping
Table of contents
Child pages
General info
Data sources
- http://commoncrawl.org/the-data/get-started/
- This looks great; you don't have to do any crawling yourself.
- https://github.com/commoncrawl/cc-mrjob
Captchas
Beating captchas
Information
Providers
- DeathByCaptcha
- Here's a copy of their API, since it's behind their paywall: DeathByCaptcha - API.pdf
- On 2018.01.02 I was helping someone figure out how to get DeathByCaptcha working for the "I'm not a robot"-type captcha. The DBC API was confusing, but eventually the guy figured out how to do it, and it looks like he was using the "Token API", which you can see here (PDF back-up).
- The docs said to look for a "sitekey" in the HTML, but I couldn't find one, and so I assumed the docs were out-of-date. Instead I used the "Network" tab to track requests, and noticed that the request to the reCaptcha server had a "k" URL parameter. I'm guessing that's the sitekey.
- You're then supposed to just use DBC's Python API (which you download from their website, it's just a couple of Python files) to send them the sitekey and the URL, and they return the token you're supposed to use in your form submission.
- But because this guy didn't want to use a browser emulator, he instead just used the Network tab to see what the POST request looked like when users would submit the form, and had his Python code send the same kind of POST request.
- 2captcha
Testing environments
Captcha types
reCaptcha v2 ("I'm not a robot")
- Reddit - ELI5: How do those checkbox "I'm not a robot" capchas work?
"If you're logged into your Gmail it assumes you're a human. If you're not you get the reCAPTCHA. Try it. We recently implemented the noCAPTHA for our high-volume online sales apps and assumed Google had some proprietary black magic at work. Nope, Gmail login."4
"I have done development work on bots that can get around captchas and reCaptchas. We pass the captcha to a third party service via an API, it is solved, and we can continue off the page. Nothing is fool proof, not even a reCaptcha."
- 2captcha - New way of solving ReCaptcha V2
- 2captcha - How to solve “I’m not a robot” captchas
- 2016.06.15 - BlackHatSEO - How to Solve ReCaptcha V2
- 2017.02.28 - East-Ee Security - ReBreakCaptcha: Breaking Google’s ReCaptcha v2 using.. Googlee/
Articles / videos
- DEF CON 23 - Ryan Mitchell - Separating Bots from the Humans
- She says the way to get around the behavioral checks is going to be to spoof those behaviors.
- 2012.08.10 - Michael Nielsen - How to crawl a quarter billion webpages in 40 hours
Python libraries
Scrapyd is an application for deploying and running Scrapy spiders. It enables you to deploy (upload) your projects and control their spiders using a JSON API.
Crawlera is a smart downloader designed specifically for web crawling and scraping. It allows you to crawl quickly and reliably, managing thousands of proxies internally, so you don’t have to.
Crawlera routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems.
Accounts provide a standard HTTP proxy API, so you can configure it in your crawler of choice and start crawling.
- It was rec'd by someone on HN here.
This has a bunch of good info:
https://news.ycombinator.com/item?id=7375575
2011.09.28 - StackExchange - Amazon EC2 + S3 + Python + Scraping - The cheapest way of doing this?
2011.10.31 - StackExchange - Most efficient (time, cost) way to scrape 5 million web pages?
2012.08.10 - Michael Nielsen - How to crawl a quarter billion webpages in 40 hours
2013.03.26 - Seminar.io - Running Scrapy on Amazon EC2
2014.03.09 - Jake Austwick - Python web scraping resource
In this article I'm going to cover a lot of the things that apply to all web scraping projects and how to overcome some common gotchas.