Abstract
Many online services, such as search engines, social media platforms, and digital marketplaces, are advertised as being available to any user, regardless of their age, gender, or other demographic factors. However, there are growing concerns that these services may systematically underserve some groups of users. In this paper, we present a framework for internally auditing such services for differences in user satisfaction across demographic groups, using search engines as a case study. We first explain the pitfalls of na\"ively comparing the behavioral metrics that are commonly used to evaluate search engines. We then propose three methods for measuring latent differences in user satisfaction from observed differences in evaluation metrics. To develop these methods, we drew on ideas from the causal inference literature and the multilevel modeling literature. Our framework is broadly applicable to other online services, and provides general insight into interpreting their evaluation metrics.
Highlights
Modern search engines are complex, relying heavily on machine learning methods to optimize search results for user satisfaction
Search engines are often evaluated using metrics based on behavioral signals, several studies have suggested that these metrics are sensitive to a variety of factors: Hassan and White [26] demonstrated that evaluation metric values vary dramatically by user; Carterette et al [10] made a similar observation and incorporated user variability into evaluation metrics; and Borisov et al studied the degree to which metrics are sensitive to a user’s search context [8]
Auditing search engines for equal access is much more complicated than comparing evaluation metrics for demographically binned search impressions. We addressed this challenge by proposing three methods for measuring latent differences in user satisfaction from observed differences in evaluation metrics
Summary
Modern search engines are complex, relying heavily on machine learning methods to optimize search results for user satisfaction. One way to assess whether a search engine provides equal access is to look for differences in user satisfaction across demographic groups. Considering the average value of the metric across all users will underemphasize the effectiveness of the search engine on retirement planning queries. Context matching, controls for two confounding contextual differences: the query itself and the intent of the user (section 5). Because this method attempts to match users’ search contexts as closely as possible, it can only be applied to a restricted set of queries. Our second method is a multilevel model for the effect of query difficulty on evaluation metrics (section 6) This method controls for fewer confounding factors, but is more generalizable. For comparison, we used our third method to conduct an external audit of a leading competitor to Bing using publicly available data from comScore (section 8)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have