Bot detection 101 #1 – Preface

Bots are on fire. VCs are pouring hundreds millions of dollars into bot detection companies, because bots cost multiple industries billions of dollars each year.

As this blog name suggests, I will focus on bots and detection in the adtech world, where bots are used to generate “invalid traffic”, which means advertiser dollars are wasted on impressions no human will ever see.

The stakes in the game are high for both side in the game, so both bot development and detection technology are in an ongoing arms race process that never stops its evolution.

When a company audits its traffic using multiple bot detection vendors, it’s not uncommon to get back results with little to no overlap between impressions marked as “SIVT” [1] by the different vendors.

This situation happens because each company has its own “secret sauce” for detecting bots, which they won’t disclose, because they don’t want bot developers to know what they are looking for. For this reason, many bot detection companies like to emphasize how much “cybersecurity talent” or “military grade AI” they have instead of transparently discussing their methodologies.

In this posts series, I will openly discuss different bot detection techniques, methodologies and tricks that are widely common across the industry, and their respective strengths and weaknesses.

A specific distinction to be made in the adtech context, is between pre-bid and post-bid bot detection:

  • Pre-bid detection takes data from the bid request and classify the user as bot or not, letting the advertiser to decide if he wants to participate in the auction before he spends a dime. The advantage is clear: no money is wasted on bots. The disadvantage is that less data point are available at this phase, so many bot initiated requests go undetected.
  • Post-bid detection is done after an impression was served. The advantage is that much more data points are available here, so the detection rate is better. the disadvantage is the money was already spent at this point, so it’s more of a “damage report” and not actual protection.

 

Part 1: Pre-bid detection

In this scenario, the available information is that of a typical bid request:

  • Client’s IP
  • IDFA or AAID
  • User-Agent string
  • Referrer
  • Publisher domain or full URL
  • App ID
  • Cookies
  • Seller ID
  • Media type

The first step is checking whether the client’s IP is actually a residential IP or a data center, VPN, TOR or proxy IP. It can be done by using a service like ThreatMetrix or by comparing the IP against publicly available lists such as AWS IP ranges. The IP is also checked against a short-living and constantly updating list of bot associated IPs, as detected by the post-bid solution, sometimes combined with the user agent string or IDFA / AAID, since bots can compromise home networks and originate from the same IP as legitimate humans.

Historical data such as the number of bid requests associated with this IP, its activity intervals (humans tend to sleep sometimes), number of clicks etc are also taken into account and outliers are marked as bots.

Same goes with the IDFA or AAID, they will be checked against blacklists of known IDs associated with bots, as detected by post bid solution or other offline analysis methods [2].

The next step will be to check whether the seller ID is configured in the ads.txt or app-ads.txt of the declared publisher domain / app ID, in order to detect domain spoofing, a term used to describe the practice of faking the publisher domain / URL within the bid request, in order to generate higher revenue for the spoofer. Another “brand safety” checks can be run against the domain / URL, app ID and referrer, but that’s not necessarily related to bot detection.

So far we covered the low hanging fruit. If the bid request is initiated directly from the client device, which is not always the case since some publishers and platforms use server-to-server connections, we have additional layers of data that can be used to detect bots:

  • TCP/IP
  • TLS
  • HTTP

Each layer exposes rich set of data that allows passive client fingerprinting in devices class granularity: accurately identifying the device, OS and user agent (app / browser) type and version that initiated the bid request. The general idea is that different network stack implementations vary in subtle ways, and these variations can be compiled into “signatures” that allows the identification.

  • TCP/IP fingerprint: Values in the SYN / ACK packets like the TTL, window size, and other fields values in the packet’s header vary across different OS / versions.
  • TLS fingerprint: In order to initiate a secure connection, the client and the server must perform “handshake” in order to agree on which protocol version, algorithms, ciphers, compression config and supported extensions they both support and can use to establish the connection. This information is transferred from the client in the ClientHello packet, and can be used as a signatures for specific user agent (app / browser) type and version.
  • HTTP fingerprint: The HTTP headers order, lack or presence and values can be used fingerprint the specific user agent (app / browser) type and version.

The widely known open source tool p0f championed the idea TCP/IP and HTTP fingerprinting, and other tools such as JA3 exists for TLS fingerprinting.

After generating the TLS fingerprint, it will be checked against blacklist of known bots fingerprints.  The next step is identifying the client OS and user agent based on HTTP and TCP/IP fingerprints, and comparing them  to the tokens in the user agent string as declared in the bid request. A mismatch will mark the bid request as initiated by bot.

The methods described above are effective against the type of bot that:

  • Is using the same technology stack for all of its fake traffic, because every time it rotates the user agent string the mismatch with the fingerprint can be detected.
  • Use adtech stack where the bid requests receiving entities are assuming bid requests are sent directly from client’s device

Under these assumptions, the bot can evade pre-bid detection in two ways:

  • Correctly spoof the fingerprints values, or use stack that’s matching to the reported user agent string, either by compromising real machines with real browsers or by setting up a bot farm that automates real browsers on the proper devices in the cloud.
  • Intentionally use server-to-server connection to the demand partner, which is possible by using Prebid server and also very common in mobile ad exchanges. In this case, the TCP/IP and / or TLS connections are initiated from the server, and the client’s HTTP headers aren’t passed on, so all the bot needs is to declare a non-blacklisted residential IP in its bid requests.

In the next post of this series we will dive into post-bid techniques, featuring the inevitable browser fingerprinting. Stay tuned!

1. Sophisticated invalid traffic, as defined by MRC’s Invalid Traffic Detection and Filtration Guideline

2. For example, a typical threat research lab would harvest binaries from the WWW / app-stores and execute them inside isolated sandboxes, flagging each binary that’s initiating high volume of ad related traffic for further investigation.

2 thoughts on “Bot detection 101 #1 – Preface

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s