Bypassing anti scraping systems

Hello readers! Today we’re going to talk about how to bypass anti scraping systems.

First, let’s talk about scraping. From Wikipedia:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Scraping is wildly used in variety of industries for various reasons. For example, it’s  often used by e-commerce companies for spying on their competitors prices, by flight agencies to gather flight routes data, real estate companies for rental data, etc.

In many cases, the original author, owner or publisher of the data doesn’t want his competitors to take his data for free or at all, because he wants to stay ahead of the competition.

This competing needs created two industries, one of companies specializing in scraping as a service, the other is companies specializing in scraping detection and prevention. This companies use the same bot detection techniques I wrote about in the past in the context of online advertising fraud, namely browser and bot fingerprinting. I also already wrote about cheating browser fingerprinting, so what is the difference this time?

The answer is that we are lazy and don’t want to work harder than necessary. Instead of carefully forging legit FPs for each anti scraping vendor used by our targets, we can just bypass it all entirely by breaking their security model assumptions. To get to the point: we don’t event need a bot to successfully scrape data off most websites.

But how? The answer is exploiting a third party JavaScript that runs on the target website, which has access to the websites DOM. We could, for example, hack the CDN of the website’s analytics provider and add our scraping code to the tracker, but that will be both difficult and highly illegal. Instead, we can use the easiest, cheapest way to get our code to runs on someone’s website, you’ve guessed it, online advertising!

As long as the ads are rendered in a same origin iframe, which is frequently the case, we can just access the top.document, find what we need, extract the data and send it back to our servers. No bot required whatsoever. Modern ad serving systems make our lives easier because we can target specific websites of interest, an once our creative loads in their page, we can use iframes, CORS XHRs and other techniques to extract data from even more pages on the site.

This method is just a special case of malvertising, or even more generally, third party risk. This category includes many more attacks. Recently I dived deeply into this space, as I seen several interesting attacks (magecart etc) and startup companies offering solutions to those. Maybe I’ll write some more about that. Hope you enjoyed the post and feel free to contact me for further discussion!

One thought on “Bypassing anti scraping systems

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s