Abstract

The massive growth of online advertising has created a need for commensurate amounts of user tracking. Advertising companies track online users extensively to serve targeted advertisements. On the surface, this seems like a simple process: a tracker places a unique cookie in the user's browser, repeatedly observes the same cookie as the user surfs the web, and finally uses the accrued data to select targeted ads. However, the reality is much more complex. The rise of Real Time Bidding (RTB) has forced the Advertising and Analytics (A&A) companies to collaborate more closely with one another, to exchange data about users to facilitate bidding in RTB auctions. The amount of information-sharing is further exacerbated by how real-time auctions are implemented. During an auction, several A&A companies observe user impressions as they receive bid requests, even though only one of them eventually wins the auction and serves the advertisement. This significantly increases the privacy digital footprint of the user. Because of RTB, tracking data is not just observed by trackers embedded directly into web pages, but rather it is funneled through the advertising ecosystem through complex networks of exchanges and auctions. Numerous surveys have shown that web users are not completely aware of the amount of data sharing that occurs between A&A companies, and thus underestimate the privacy risks associated with online tracking. To accurately quantify users' privacy digital footprint, we need to take into account the information-sharing that happens either to facilitate RTB auctions or as a consequence of them. However, measuring these flows of tracking information is challenging. Although there is prior work on detecting information-sharing (cookie matching) between A&A companies, these studies are based on brittle heuristics that cannot detect all forms of information-sharing (e.g., server-side matching), especially under adversarial conditions (e.g., obfuscation). This limits our view of the privacy landscape and hinders the development of effective privacy tools. The overall goal of my thesis is to understand the privacy implications of Real Time Bidding, to bridge the divide between the actual privacy landscape and our understanding of it. To that end, I propose methods and tools to accurately map information-sharing among A&A domains in the modern ad ecosystem under RTB. First, I propose a content-agnostic methodology that can detect client- and server-side information flows between arbitrary A&A domains using retargeted ads. Intuitively, this methodology works because it relies on the semantics of how exchanges serve ads, rather than focusing on specific cookie matching mechanisms. Using crawled data on 35,448 ad impressions, I show that this methodology can successfully categorize four different kinds of information-sharing behaviors between A&A domains, including cases where existing heuristic methods fail. Next, in order to capture the effects of ad exchanges during RTB auctions accurately, I isolate a list of A&A domains that act as ad exchanges during the bidding process. Identifying such A&A domains is crucial, since they can disperse user impressions to multiple other A&A domains to solicit bids. I achieve this by conducting a longitudinal analysis of a transparency standard called ads.txt, which was introduced to combat ad fraud by helping ad buyers verify authorized digital ad sellers. In particular, I conduct a 15-months longitudinal study of the standard to gather a list of A&A domains that are labeled as ad exchanges (authorized sellers) by publishers in their ads.txt files. Through my analysis on Alexa Top-100K, I observed that over 60% of the publishers who run RTB ads have adopted the ads.txt standard. This widespread adoption allowed me to explicitly identify over 1,000 A&A domains belonging to ad exchanges. Finally, I use the list of ad exchanges from ads.txt along with the information flows between A&A companies collected using my generic methodology to build an accurate model of the privacy digital footprint of web users. In particular, I use these data sources to model the advertising ecosystem in the form of a graph called an Inclusion graph. Through simulations on the Inclusion graph, I provide upper and lower estimates on the tracking information observed by A&A companies. I show that the top 10% A&A domains observe at least 91% of an average user's browsing history under reasonable assumptions about information-sharing within RTB auctions. I also evaluate the effectiveness of blocking strategies (e.g., AdBlock Plus) and find that major A&A domains still observe 40-90% of user impressions, depending on the blocking strategy. Overall, in this dissertation, I propose new methodologies to understand the privacy implications of Real Time Bidding. The proposed methods can be used to shed light on the opaque ecosystem of programmatic advertising and enable users to gain a more accurate view of their digital footprint. Furthermore, the results of this thesis can be used to build better or enhance existing privacy-preserving tools.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call