Detecting Privacy Badger’s Canvas FP detection

Hello readers! As promised in previous blog post, today I’ll write (a bit more technically) about third party JS security, but from a different angle.

Privacy Badger

Privacy Badger is a privacy focused browser extension by EFF, that detects and blocks third party trackers. Unlike other extensions, it does it by analyzing the tracking behaviors, rather than relaying on domains blacklist.

Canvas fingerprinting

On of these tracking behaviors is canvas fingerprinting, which I briefly mentioned in previous blog posts. Generally speaking, canvas fingerprinting is a method to generate stateless, consistent, high entropy identifier from the HTML5 canvas element, by drawing several graphics primitives into it and then serialize its pixels. Different browsers and devices produce slightly different pixels due to differences in their graphics rendering stack. You can read the paper “Pixel Perfect: Fingerprinting Canvas in HTML5” for more info.

Privacy Badger Canvas fingerprinting detection

From Privacy Badger website:

Does Privacy Badger prevent fingerprinting?

Browser fingerprinting is an extremely subtle and problematic method of tracking, which we documented with the Panopticlick project. Privacy Badger 1.0 can detect canvas based fingerprinting, and will block third party domains that use it. Detection of other forms of fingerprinting and protections against first-party fingerprinting are ongoing projects. Of course, once a domain is blocked by Privacy Badger, it will no longer be able to fingerprint you.

How Privacy Badger detect canvas fingerprinting

Privacy badger injects fingerprinting.js, along with several other context scripts, as specified in its manifest.json, to all the frames (all_frames: true) of all the pages (“matches”: [ “<all_urls>” ]) visited by the user, before any other script in the page has executed (run_at: document_start).

Content script have access to their frame DOM, but a separate JavaScript context. Because the goal of the script requires to monitors things that happen in the page JS context (canvas manipulation and serialization), this content script injects another, self removing script into the frame DOM, which executes in its JS context.

This script hooks into several canvas related APIs, including fillText (manipulation) and toDataURL (serialization). I wrote about JS hooking before, in the context of spoofing viewabiliy measurements. Whenever once of these APIs gets called, Privacy Badger hook is figuring out the caller script URL form within the call stack.

Threat Model

When designing and implementing fingerprinting countermeasures, there are two significant concerns:

  • Observability: which means trackers can fingerprint the presence of the fingerprinting countermeasure itself and using it as another data point in the fingerprint.
  • Bypassability: which means tracker can evade the fingerprinting countermeasure or rendering it useless, thus getting access to the desired fingerprinted feature.

Vulnerabilities in Privacy Badger canvas fingerprinting detection

  • Observability of the canvas API hooking:

as I wrote previously in depth at “JavaScript tampering – detection and stealth” (my most visited blog post so far!), there are several methods to detect that a native function was tampered with. Privacy Badger recognized this threat and tries to hide the tampering by setting the length, name, and toString properties of the hooked functions to match those of the original, but without referring to the native Function.protype.toString, a tracker can write:

Function.prototype.toString.call(HTMLCanvasElement.prototype.toDataURL);

And get:

"function wrapped() {
          var args = arguments;
...

Of course, it also won’t pass the prototype and hasOwnProperty test (detailed explanation here).

  • Bypassability of the APIs hooking

Privacy Badger recognized this threat site code tampering with its own code, and tries to prevent this by copying the objects it uses into its own function scope. However, it still relies on prototype inherited methods inside the hook code itself, and these methods can be abused to steal the reference to the original API. Let’s look closely on the hook code itself, which gets called whenever a consumer calls one of the hooked canvas APIs:

        function wrapped() {
          var args = arguments;

          if (is_canvas_write) {
            // to avoid false positives,
            // bail if the text being written is too short
            if (!args[0] || args[0].length < 5) {
              return orig.apply(this, args);
            }
          }

          var script_url = (
              V8_STACK_TRACE_API ?
                getOriginatingScriptUrl() :
                getOriginatingScriptUrlFirefox()
            ),
            msg = {
              obj: item.objName,
              prop: item.propName,
              scriptUrl: script_url
            };

          if (item.hasOwnProperty('extra')) {
            msg.extra = item.extra.apply(this, args);
          }

          send(msg);

          if (is_canvas_write) {
            // optimization: one canvas write is enough,
            // restore original write method
            // to this CanvasRenderingContext2D object instance
            this[item.propName] = orig;
          }

          return orig.apply(this, args);
        }

 As we can see, there’s an interesting exception: if is_canvas_write is true and the length of the first arg is shorter then 5, the original function gets called, using the prototype inherited apply method, and returns before send(msg) is called, so Privacy Badger won’t be considering it as a fingerprinting attempt, to avoid false positives.

We can look few lines up and see that is_canvas_write is computed as:

      var is_canvas_write = (
        item.propName == 'fillText' || item.propName == 'strokeText'
      );

So, our attack will look like this:

    • Hook the apply method
    • Call the hooked fillText or strokeText
    • Steal the reference to the original fillText or strokeText
    • Write to the canvas text with length > 5 using the original function

Let’s implement a PoC:

let _apply = Function.prototype.apply;
let original;
Function.prototype.apply = function () {
	// `this` is the function
	if (this.name === 'fillText' || this.name === 'strokeText') {
		original = this;
	}
	// restore the original apply
	Function.prototype.apply = _apply;
};

Then, we call the function:

var canvas = document.createElement('canvas');
var ctx = canvas.getContext('2d');
ctx.fillText('a');

And now we have the original fillText:

original
ƒ fillText() { [native code] }

Viola!

The same technique can be used to extract the original serialization method, toDataURL. Notice the call to getOriginatingScriptUrl which is also using prototype inherited methods that can be tampered with.

Another bypass method is to obtain a references to the original APIs by using the iframe sandbox attribute. This attribute allows us to specify permissions for the content inside the iframe, and if we specify the allow-same-origin permission and don’t specify the allow-scripts permission, the script injected by the context script won’t execute, according the the sandbox policy[1], but the embedding page will be able to access the iframe’s contentWindow and obtain an unhooked canvas from it.

That’s it for today! Although this topic could be expanded  even more, I’ll save something for next time 🙂

Hope you enjoyed, and feel free to contact me to discuss any of it!

[1] This is currently true in Firefox, but not in Chrome. In the past I observed the same behavior in Chrome, but from my test it seems like now DOM script that was added from content script will execute inside sandboxed iframes. I’m not sure if that’s intentional.

Bypassing anti scraping systems

Hello readers! Today we’re going to talk about how to bypass anti scraping systems.

First, let’s talk about scraping. From Wikipedia:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Scraping is wildly used in variety of industries for various reasons. For example, it’s  often used by e-commerce companies for spying on their competitors prices, by flight agencies to gather flight routes data, real estate companies for rental data, etc.

In many cases, the original author, owner or publisher of the data doesn’t want his competitors to take his data for free or at all, because he wants to stay ahead of the competition.

This competing needs created two industries, one of companies specializing in scraping as a service, the other is companies specializing in scraping detection and prevention. This companies use the same bot detection techniques I wrote about in the past in the context of online advertising fraud, namely browser and bot fingerprinting. I also already wrote about cheating browser fingerprinting, so what is the difference this time?

The answer is that we are lazy and don’t want to work harder than necessary. Instead of carefully forging legit FPs for each anti scraping vendor used by our targets, we can just bypass it all entirely by breaking their security model assumptions. To get to the point: we don’t event need a bot to successfully scrape data off most websites.

But how? The answer is exploiting a third party JavaScript that runs on the target website, which has access to the websites DOM. We could, for example, hack the CDN of the website’s analytics provider and add our scraping code to the tracker, but that will be both difficult and highly illegal. Instead, we can use the easiest, cheapest way to get our code to runs on someone’s website, you’ve guessed it, online advertising!

As long as the ads are rendered in a same origin iframe, which is frequently the case, we can just access the top.document, find what we need, extract the data and send it back to our servers. No bot required whatsoever. Modern ad serving systems make our lives easier because we can target specific websites of interest, an once our creative loads in their page, we can use iframes, CORS XHRs and other techniques to extract data from even more pages on the site.

This method is just a special case of malvertising, or even more generally, third party risk. This category includes many more attacks. Recently I dived deeply into this space, as I seen several interesting attacks (magecart etc) and startup companies offering solutions to those. Maybe I’ll write some more about that. Hope you enjoyed the post and feel free to contact me for further discussion!

Secure header bidding architecture

Hello readers! It’s been quite a while. Today, I want to suggest a less insecure bot filtering architecture for online advertisers and publishers.

First, let’s have a quick overview of the current situation and define some terms:

Header bidding

Forget all the nonsense about “wrapper” that “sits the in header” and “wraps calls to the ad server with other demand partner”. I’ve read and heard explanations along these lines way too many times, and they are great example of complete adtech (jargon) madness. Like pretty much anything else in the browser, the “wrapper” is just a script that makes some network calls and execute some logic. Whether it’s located in the <head> or <body> is completely irrelevant implementation detail. </rant>.

In technical terms, header bidding library defines an interface which its adapters are implementing, allowing its consumers to use unified API to communicate with all of their demand partners. Originally invented as a hack to create a fair competition between Adx supply and other SSPs, and the de facto standard is Prebid.js (pbjs) which was originally created by AppNexus. Most other “unique proprietary header bidding wrappers” are just clones of pbjs with changed global variable name, and maybe some other niche features that cost way too much money for their value.

The general flow of pbjs look like this:

User loads the page, and pbjs is loaded with it. The page includes definitions of its ad units, sizes, media type, etc, and pbjs sends this info, along with the user info, to the configured demand partners by invoking their adapters. Then pbjs waits for response from all demand partners or a specified timeout, passing all the returned bids to the ad server using key-value targeting. The ad sever receives the ad request with the competing bids, runs its ad selection process, and returns its own creative if it have a better paying line item, and if not, a pre-configured pbjs creative that basically tells pbjs to render its highest bid’s ad markup.

So far so good.

Bot detection

As I wrote before, current bot detection for advertising  have two phases, pre-bid and post bid. To summarize, pre-bid detection is based on the network properties of the client that makes the bid request, and post-bid is based on browser fingerprinting. The advantage of pre-bid detection is that it happens before money is spent, the disadvantage is it’s less accurate and reliable than post-bid, which is on the other hand more secure, but occurs after the ad is already served and money is already spent.

 

Server to server

Remember I wrote about pbjs, “Then pbjs waits for response from all demand partners or a specified timeout”? It’s actually a bigger deal then it sounds at first. Latency is a big burden on both user experience and CPM rates (as it negatively affects viewability and number of served impressions), so a solution emerged in the from of Prebid server. It works the same as regular pbjs flow, but instead of invoking bidders from the clients browser, it invokes the from a dedicated server, returning to the client only the winning bid.

But it also breaks the security model of pre-bid bot detection, since now the client isn’t make the bid request to SSP by design, and that’s bad, because now bots are detected only after money is spent.

 

My suggested solution

Bot detection vendors set up a Prebid server that sits as a proxy between the publisher and the SSPs. It then should convince publisher to add their code, perhaps as a Prebid module, which will contains their bot fingerprinting code and will append to the initial request for the prebid server. The results should be mandatory and if not included, no bid requests will be passed to demand partners.

With this architecture, advertisers get the best of both worlds: the security of post-bid detection solutions with the money saving effect of pre-bid detection solutions.

For publisher, the incentive to implement this module will be higher CPMs and viewability rates, since all of their traffic is now validated. It’s also technically possible to create a module that reads the client side pbjs config an automatically re-directs it through the prebid server, which saves the publisher the technical overhead of moving to prebid server, while gaining the latency improvements.

The challenges of this solutions, are the costs associated with the required amounts of network and computational resources, but it maybe can be done economically.

Let me know what do you think! I’m waiting to see more JIRA refers to the blog now 😉

How Google Ad Manager counts impressions

Hello readers! It’s been a while. Today I’ll write about how Google Ad Manager, perhaps the most widely used ad server in the world, technically counts impressions. This topic is slightly off the usual adtech-madness / security / fraud I usually write about, but I think this information can be useful for both adops people and adtech developers out there who read my blog (yes, I do see all the JIRA URLs in the referrers).

First, let’s define two different terms:

  • Served impression: An impression counted when ad is sent, or “served” to a publisher by Ad Manager. The ad creative may or may not be rendered, downloaded in the user’s device, or viewed by the user.
  • Downloaded impression: An impression counted only after the initiation of ad retrieval by the publisher. The ad creative is downloaded in the user’s device and has begun to load. A separate viewability measurement will determine if the user has viewed the creative.

As documented by Google, since October 2017, Ad Manager counts downloaded impressions rather than counting the previously used served impressions.

Downloaded impressions are counted in Ad Manager’s report metric “Total impressions”, as defined in this documentation: “Total impressions from the Google Ad Manager server, AdSense, Ad Exchange, and yield group partners.” 

Now, let’s take a look on the mechanism that enables the served and downloaded impressions counting:

  • Served impressions: counted when ad response is generated by Ad Manager, following an ad request sent from a client (GPT, Mobile SDK, etc), minus empty responses. A typical ad request is an HTTP GET request of the following structure:
https://securepubads.g.doubleclick.net/gampad/ads?gdfp_req=1&pvsid=1695018879342553&correlator=4175412589612631&output=ldjh&callback=googletag.impl.pubads.callbackProxy1&impl=fifs&adsid=NT&json_a=1&eid=21064552%2C21063146%2C21063669%2C21063818%2C21064521%2C370204053&vrg=2019090501&guci=2.2.0.0.2.2.0.0&plat=1%3A32776%2C2%3A32776%2C8%3A134250504&sc=1&sfv=1-0-35&ecs=20190912&iu_parts=123456789%2CAdunit&enc_prev_ius=%2F0%2F1%2C%2F0%2F1&prev_iu_szs=970×90%7C728x90%2C970x90%7C728x90&prev_scp=pos%3Dheader%26loc%3Datf%7Cpos%3Dbottom%26loc%3Dbtf%26hb_pb_appnexus%3D0.07%26hb_adid_appnexus%3D15d03fa1555417%26hb_bidder_appnexus%3Dappnexus%26hb_pb%3D0.07%26hb_adid%3D15d03fa1555417%26hb_bidder%3Dappnexus&eri=1&cust_params=url%3D%25&cookie_enabled=1&bc=31&abxe=1&lmt=1568307941&dt=1568307941545&dlt=1568307939347&idt=1736&frm=20&biw=1286&bih=150&oid=3&adxs=158%2C63&adys=20%2C1281&adks=3320811415%2C3303837207&ucis=1%7C2&ifi=1&u_tz=180&u_his=2&u_h=768&u_w=1366&u_ah=744&u_aw=1301&u_cd=24&u_nplug=2&u_nmime=2&u_sd=1&flash=0&url=https%3A%2F%2Fwww.example.com%2F&dssz=51&icsg=149533715834928&std=3&vis=1&dmc=8&scr_x=0&scr_y=0&psz=1286×3369%7C1160x130&msz=1286×130%7C1160x90&ga_vid=1107594786.1568307942&ga_sid=1568307942&ga_hid=1496006620&fws=4%2C4&ohw=1286%2C1286

Which contains URL parameters with data about the user, device, publisher, page view, ad unit, key value targeting (in this case, including competing header bidding bid from appnexus) and more. Ad manager then runs ad selection process using this data, and increments the served impressions count, unless no appropriate ad to be served is found for the ad request. 

  • Downloaded impressions: A typical ad response is contained within the HTTP response body from Ad Manager, and looks like this:

    {

        “/123456789/Adunit“: [“html”, 0, 0, null, 0, 90, 728, 0, 0, null, null, null, 1, [

                [“ID=77g96419f2b322c6:T=1568489951:S=ALMI_NZ7qGoFC9bbi5KjU33fgAMgJ48XZw”, 16313278841, “/”, “example.com”]

            ],

            [138236601521],

            [126004284],

            [57387324],

            [452672344], null, null, null, null, null, null, null, 0, null, null, null, null, null, null, “CiYIsOaKPOgB4sms_IIEggIMoMW2YZjGtmHw1b5h0QJdRAkCBGXItQ”, “CNi3iq7iy-QCFQqxewodrpcMlw”, null, null, null, null, null, null, null, null, [“011908231648370”], null, null, null, null, null, “1”

        ]

    }

    <!doctype html>… <!– Ad HTML in here –>

    As you can see, it contains:

    • The ad unit which this ad response targets (i.e., the ad request made on behalf of this ad unit)
    • Data about the ad type, size (728×90 in this case), and whether the ad response is empty
      • in this case 0, means false. Otherwise the parameters described below this line would be null), and the impression would not be counted, even as served impression.
    • Additional response information such as line item, advertiser, creative and unique impression IDs.
    • The actual ad HTML

    After the ad response is received by the requesting client, it renders the impressions by writing the actual ad HTML into the appropriate ad slot. The very first lines of standard Ad Manager creative boilerplate (i.e., HTML/JS code that’s automatically added into every creative served by Ad Manager) looks like this:

    <!doctype html>

    <html>

       <head>

          var inDapIF=true,inGptIF=true;

       </head>

       <body leftMargin=“0” topMargin=“0” marginwidth=“0” marginheight=“0”>

          window.dicnf = {};

    data-jc=“42”>(function(){window.viewReq=[];function b(a){var c=new Image;c.src=a.replace(“&”,“&”);viewReq.push(c)}function d(a){fetch(a,{keepalive:!0,credentials:“include”,redirect:“follow”,method:“get”,mode:“no-cors”}).catch(function(){b(a)})}window.vu=function(a){window.fetch?d(a):b(a)};}).call(this);
    vu(https://securepubads.g.doubleclick.net/pcs/view?xai\\x3dAKAOjsvlNB-1GEn9_F-ujVRLeZRY2uFIKCTz0iCUIZnYnMe3FedtBHuK2tHy2YcAyHovATfoX1xQ1upqHi5dW7Yaqo32mOaGOtIWdk41n8qMPlFXQmw8xr8jgkamv0xy9RorCUG_EefC97MYvCy33W1Yz3uCsIxj_XfYNNj_RiFFa_f3bm2Fje0M_C0EGOb8vwp7ID8jkuVJWrP106UlilVdPjj9zS7J6QM0fKIksO_Z3ArqyR8l7vjAovsFyhe8Uq0pKLHXryUF9LXWjHxLm2Ho29vf\\x26sig\\x3dCg0ArKJSzKN18dIe4SN8EAE\\x26urlfix\\x3d1\\x26adurl\\x3d”)

    As you can see, the last two scripts that appear in this excerpt are:

    • Defining a function named vu that invokes a GET request to a given URL
    • Invokes the vu function with a URL specific for this ad impression

This scripts are executed synchronously (since they are parser inserted inline scripts), which means that before anything else, the browser will fire the view HTTP request and once it hits Ad Manager, it will be counted as Downloaded impression. 

I hope you find this information useful!

Ad injection 101 – History and technical overview

Hello readers! It’s been a while. I went through a rather busy period, but today I’m ready to present you with yet another adtech madness: ad injection.

Ad injection is the practice of modifying web pages on the client side, by a third party application, in order to present the user with its own ads. The result is that the third party app monetizes the user browsing sessions, instead of the publishers.

In cyber-security speak, ad injection is man in the browser attack (MiTB) that targets ads serving and revenue.

Ad injection is a sub niche of “adware”, software that’s designed to generate ad impressions.

History of ad injection

While no one knows for sure when the first ad injection software was created, one thing is clear: a boom in this industry has begun around late 2012. Ad injections were certainly around before, as shown in 2008 study “Detecting In-Flight Page Changes with Web Tripwires“.

However, things changed around late ’12. Modern browsers were letting users to install extensions from all over the web, not only web store approved ones, without asking any questions.

Ad injections companies realized the can spend 1$ on user acquisition and monetize the same user at 2$ (or even more), so it wasn’t long until VC money started pouring into the industry. I’ve personally witnessed some companies within this space grow to >10m$ in revenues and >100$m valuations in less then two years.

These companies were wild and greedy. They didn’t gave a shit about the users who got infected with their adware. But how did they get users to install it?

In the beginning, ad injection companies were doing both distribution (getting users to install) and monetization (connections to the advertising industry and adops work to improve revenues). The two main distribution channels were malvertising and installers.

The malvertising side used deceptive ads that convinced the users they need to install an update, some tool to fix PC errors, or player to watch a video. Here’s an example:

The “outdated download manager” ad was actually injected in the page by an ad injector, but will lead to user into a landing page that try to make him install another ad injector. This situations happen because on the monetization side, ad injection companies started “ad networks” in order to connect their supply to advertising demand.

Since they didn’t really have quality standards, ad injections ad malvertising were strongly connected, with some of the ad injection ad networks delivering almost exclusively malvertising, often of another ad injection companies. This was studied in depth in the paper “Understanding Malvertising Through Ad-Injecting Browser Extensions“.

The other distribution channel, installers, offered developers to use custom installer that “bundles” more “offers” to install, and generates revenue for the developer for each install: pay per install. The PPI industry is a long time well known malware distribution vector, as shown in “Measuring Pay-per-Install: The Commoditization of Malware Distribution” and “The New Malware Distribution Network“. They often use dark patterns in order to trick users to install malware.

In later years, the industry split and specialized: some companies focused on distribution, some on monetization. The two former was extensively studied in the papers “Measuring PUP Prevalence and PUP Distribution through Pay-Per-Install Services” and “Investigating Commercial Pay-Per-Install and the Distribution of Unwanted Software“, and the latter in “Ad Injection at Scale: Assessing Deceptive Advertisement Modifications“.

Others, such as Komodia, provided the industry with the technology required to reliably intercept and modify traffic across different OS.

Technical overview

Ad injectors were most commonly implemented as browser extensions, which were easy to develop, maintain and distribute. After google started to ban ad injecting extensions, implementation shifted towards applications who used questionable techniques, from changing DNS and / or proxy settings in order to modify ads traffic, or injecting DLL into the browser in order to achieve MiTB and modify ads. These apps were horrible for security, as they routed traffic through untrusted servers, compromising the integrity of the browser process and installing bogus certificates. One big famous case is the lenovo / Superfish scandal, where lenovo sold laptops with the Superfish adware and its self signed certificated pre-installed.

Ad inventory characteristics

The interesting thing about all the ad inventory supply that was created by ad injectors, that it was never marked as invalid traffic. Remember, the ads were injected into a real browsers used by real humans on legitimate websites. Today, injected inventory is considered “domain spoofing” at best, if the ad injector injected into and Ad.txt enables website and do not sell the inventory through an authorized “reseller”.

Ad injections today

Probably not big as at used to be, but it’s still existing as a dark corner of the software and advertising industries. There’s even a startup called “Namogoo” that’s selling a solution to prevent ad injections to publishers. Former companies in this space such as eDakan and Cabara are now defunct. The only exist of such company so far belongs to ClarityRay which acquired by Yahoo! in 2014.