Bot detection 101 #2 – Entering browser fingerprinting

In the previous blog post of this series, I discussed pre-bid bot detection technology and its strengths and weaknesses. Today I will focus on post-bid bot detection, as defined in the previous post:

Post-bid detection is done after an impression was served. The advantage is that much more data points are available here, so the detection rate is better. the disadvantage is the money was already spent at this point, so it’s more of a “damage report” and not actual protection.

Personally I like to think of bot detection as a reverse turing test. Traditional turing test is where human is asking an unknown entity questions, in order to decide if it’s human or algorithm, i.e. bot, based on its answers. The reverse test is an algorithm asking the unknown entity questions in order to decide if it’s human.

Of course, solutions like CAPTCHA already exist, but prompting the user for solving CATPCHA is too much of a hassle in the advertising scenario. The user experience must be seamless, since no one would put the effort of solving CAPTCHA [1] in order to view or click ads, so other means of asking questions are needed.

So instead of questioning the user itself, it’s possible to question the user agent (browser) a lot of different questions. The additional data points mentioned above are used to answer these questions, and they are gathered via browser fingerprinting.

The general idea is, just like network stacks, different browser implementations vary in ways that allow to differentiate between different devices, OS, browsers types and versions, and sometimes even between different browser instances.

Once again, in the context of ad tech, there’s a distinction to be made between two different goals of browser fingerprinting:

  • Tracking users, e.g. “cookie-less tracking”, by extracting as much information (entropy) as possible about the specific browser instance in order to generate a pseudo-unique identifier that can be used to track the user across the web. This enables, among other things [2], personalized ads targeting.
  • Bot detection, e.g. “bot-printing”, by identifying differences between human users and bots, either by spotting differences in the browser implementation itself, signals of interaction with known automation tools, or how it’s being used.

As promised, I will focus on the second goal of browser fingerprinting: bot detection.

But first, let’s understand what type of data are we actually talking about. The question to ask that naturally follow, is what is a browser? Fundamentally, the browser is the user agent, which means it acts on behalf the user, letting him view and navigate hypermedia resources over the WWW.

The next question is, what is the underlying technology that composing a the browser? The full answer is awfully complex, but from high level overview:

  • User interface. Thins like the address bar, back button, bookmarks, etc. Everything the user uses to instruct to the browser of what he wants him to do.
  • Network stack. Used to fetch web pages. It can either use OS supplied libraries or its own implementation. HTTP, FTP, SPDY, HTTP2, etc.
  • Rendering engine. Parses HTML, XML into a DOM tree, parse CSS and apply it to the DOM to create renderer tree and painting the result to screen.
  • JavaScript engine. Parses end executes JS code.
  • UI backend. Uses the underlying OS to draw widgets like input fields on buttons.
  • Plugins and extensions: additional third party code which adds functionality to the browser. PDF readers, Flash (RIP), Dictionaries, etc.

This is just a partial list, so as you can see, there are many layers and components under the hood [3].

Technically speaking, the questioning is mostly done via JavaScript, the lingua franca of the web. Browsers are exposing rich information about each and every one of the components listed above to the JavaScript engine, either by intentionally exposing API for scripts or by leaking information through side channels.

If I’ll get even more pedantic, the bot detection vendors don’t actually ask all the questions in JavaScript. Whenever possible, they will just use it to collect data and will ask the questions on the backend, because as I said before, the don’t want bot developers to know what they are looking for. Remember that everything that is transmitted to the client side can be seen by anyone.

Let’s look at some of the data points commonly collected by bot detection fingerprints:

  • List of plugins, fonts, text-to-speech voices, supported media types for audio and video
  • Canvas and Audio values
  • Support and behaviors of HTML5 APIs
  • Support and behaviors of JavaScript features
  • Support and behaviors of CSS rules
  • Support of non-standard browser features
  • Presence of bot-specific signatures
  • Presence of  common automation tools
  • Presence of different UI components
  • Sizes of different UI components
  • Signals of different automation tools
  • Timing of different operations such as timers, animations and execution of the FP itself
  • Monitoring calls to different functions and capturing stack-trace  when called
  • Events such as mouse movements, clicks, and tab changes
  • Load / error test of various resources with different schemes (protocols) and paths

And many, many more. All of these data points are posted back to the backend, where questions are asked and decisions are made. Let’s go over some questions that are frequently asked and methods used to answer them:

  • Is this browser spoofs its user agent string?
    • Fingerprinting the rendering and JavaScript engine and comparing them to the reported user agent string. For example, only Gecko (of Firefox) supports the afterscriptexecute event, and only old V8 (of Chrome) supports captureStackTrace API
    • Fingerprinting the underlying OS based on the supported fonts, media codecs, plugins, or UI components sizes For example, Only macOS will have the “Apple Garamond” font, and input fields  have specific different dimensions across different OS’s.
    • Fingerprinting the device type based on hardware characteristics, for example using reported screen size, and comparing it to the size from CSS media queries to detect spoofing. WebGL can be used to find the underlying GPU driver, which will be a closed set for iPhone, etc. The canvas and audio values can also be used to identify specific hardware.
  • Is this browser have UI? Humans tend to use UI and bots don’t.
    •  Check for scrollbars, address bar, status bar, etc reported (can also be used to fingerprint the OS)
    • Try to load specific XUL resources that are part of the UI in Firefox
    • Check for visual cues for errors such as default error image
  • Is this browser is a specific bot implementation? Humans use only regular browsers.
    • PhantomJS exposes window._phantom and winow._callPhantom, and in some versions include the non-standard stackArray in exceptions obects and stack traces of functions called from page.evaulate will contain “phatomjs://”.
    • JSDOM based bots will expose window.__stopAllTimers
  • Is this browser runs inside a virtual machine? Many bot farms are running on the cloud.
    • Different timings of things like web workers because of different number of cores
    • Specific virtualization related files exists in the file system (some bot detection vendors are used or using browser bugs that allow them to determine that)
    • Lack of or partial support of WebGL sharers and extensions, and specific GPU drivers
  • Are there any signals for common automation tools?
    • Fingerprinting different web-driver clients. In firefox the extension would add “webdriver” attribute to the documentElement (<html>), in IE it will expose the document.__webdriver_script_fn upson script execution.
    • Selenium server will listen by default on localhost:4444/wd/hub/status
  • Is this browser supports fundamental web security features?
    • Bot usually turn off security features like safe browsing and maybe even TLS certificate validation. It’s possible to try to load dangerous resources, for example an https:// resource with invalid cert, and listen for onload event, which shouldn’t happen in regular browsers.
    • Some bots go even as far of turning of Same Origin Policy, which is trivially checked by trying to access objects form another origin, which should always throw an exception.
  • Is this browser actually paints to the screen? For bots, painting is an unnecessary overhead.
    • Checking the FPS using timing of animations
    • Timing of re-flow operations differ between real and headless browsers
    • Browser specific attributes such as window.mozPaintCount in Firefox

OK, you get the idea already, but this is just the tip of the iceberg. There are many, many more questions and methods to answer them, the list goes on and on, I can continue writing all night long.

As I said before, each bot detection have its own set of questions and methods to answer them which make up their “secret sauce”. Since they wont transparently discuss them, both because bot developers and competitors, their client are asked to blindly trust them, based on reputation (or hype) alone.

In the next post of this series, #3, I will focus on the security assessment, namely the strengths and weaknesses of browser fingerprinting, and explore an alternative approach: behavioral analysis.

Stay tuned!

1. Or training Google’s AI classifiers for free. Click those damn fire hydrants! our self driving car just crash right into them!

2. Such as surveillance…

3. For those of you who want to learn more, I recommend to start with this great article.

5 thoughts on “Bot detection 101 #2 – Entering browser fingerprinting

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s