In the previous blog post of this series I discussed how browser fingerprinting is used for bot detection. In this blog post, I’ll discuss several different approaches I’ve seen taken by bad guys, and good guys trying to catch cloaking, in order to bypass these techniques.
In the context of browser fingerprinting used for user tracking, there are three options:
- Blend in the crowd: this approach is taken by the TOR browser project. If all users produce the same fingerprint, they are all indistinguishable from each other and therefore are un-trackable. It’s a pretty good strategy, however, it requires the user to use a custom browser and is burdensome to maintain, since every time the browser vendor ship a new feature that allows fingerprinting it needs to be normalized (think supported fonts) or in case normalization is not feasible (think canvas fingerprinting), disabled entirely, which can break some web pages that relay on this functionality for legitimate purposes.
- Rotating fingerprints: this approach is explored in depth this paper from Microsoft research. The idea behind this approach is to slightly mutate some of the fingerprintable attributes of the browser, on per session basis. The result is that the fingerprint might be unique, but only for the current browsing session, so there’s no session link-ability that allows tracking the user across multiple sessions. This similar to what incognito, aka private browsing is to cookies and complements it. The mutations must appear logical, because if they are an obvious lie, this fact can be used as a tracking point of in itself. For example, removing some specific text-to-speech voice is reasonable, but reporting Firefox user-agent when you’re clearly using Webkit based browser is a bad idea.
In the context of browser fingerprinting used for bot detection, these methods are limited:
- Blending in the crowd, i.e. making all of your bots having the same fingerprint, might prevent tracking them individually but won’t prevent anyone from detecting them as bots. However, some bot do try to blend in the crowd in different way: they try to adjust their “browser profile” to one that makes sense with their declared user agent. For example, the Methbot operation took this approach. In the code below, which defines the “window” object of the Meth browser, you can see logic that says “If we pretend to be google chrome, expose window.chrome object with properties similar to that of a real chrome browser” :However, this approach requires a very high overhead in order to achieve full coverage, i.e. working universally across all different fingerprinting methods, since each and every browser component of different browser exhibits its own idiosyncratic set of features, bugs and behaviors. For example, think of vendor prefixes of DOM API and CSS rules. It would require the bot developer to maintain an huge mapping of user agents and their corresponding entire WebIDLs, including undocumented APIs. Another challenge would be to emulate behaviors. While it’s easy to spoof the window.chrome object, or setting any other property really, it would be much harder to correctly emulate the behavior of some of the APIs, for example the canvas addHitRegion method, and it needs to be done for each end every API exposed to scripts for each different browser type and version, which is impractical. However, getting this to work for specific data points collected by specific bot detection vendors is possible and have been seen in the wild.
- Rotating the fingerprint: same as above.
- Blocking the tracker: Might actually work in some cases, but if the bot detection vendor have a visibility into the bid requests as well, it can notice the discrepancy in the data: high amount of bid requests without a corresponding fingerprint would raise a suspicion, but this method can work well for blending in limited amount of bot traffic in legitimate traffic, as 5% discrepancy in data is nothing unusual in adtech.
However, there are another approach which is effective against specific bot fingerprint libraries: replay attack. Generally speaking, this attack describes a scenario where attacker captures a valid token that authorize access to a resource from its legit owner, and then replays the token in order to gain access to that resource. From Wikipedia:
In ad fraud scenario, it require the attacker to launch the attack in three phases:
- Bootstrap phase: capturing the fingerprint produced by the attacked bot detection library on variety of different device, OS, browser, version combinations and creating a DB with mapping between the user agents and their fingerprints. The challenge would be getting trusted non-bot traffic, but it’s not too hard for anyone with access to high quality traffic or willingness and ability to pay for it. Another challenge would be to omit timestamps and replace them with place holders, to be dynamically replaced with appropriate timestamps (with timezone corresponding to declared geo) later in real time.
- Setup phase: configuring the bot to retrieve the appropriate fingerprint from the DB according to the user agent used, replacing the timestamps, intercepting the fingerprinting library result and then replacing it with the retrieved fingerprint before it’s being submitted to the server.
- Money phase: make ’em dollars! $
One of the reasons it’s easy for the bad guys to develop such evasion techniques is that they can pretty easily get a trail account at the bot detection vendor service, which means they now have access to the UI and the service’s decisions, which closes the feedback loop and let them test different techniques, or just different traffic sources, until they find one that passes as legit. To see this in action, I recommenced reading Shailin Dhar’s great “Mystery Shopping Inside the Ad-Verification Bubble“, where he did exactly that.
That’s it for today. Let me know if you have any questions or interests, so I can decide what to write about next. Hope you enjoyed!