Abstract

Offline evaluation of recommender systems (RSs) mostly relies on historical data, which is often biased. The bias is a result of many confounders that affect the data collection process. In such biased data, user-item interactions are Missing Not At Random (MNAR). Measures of recommender system performance on MNAR test data are unlikely to be reliable indicators of real-world performance unless something is done to mitigate the bias. One widespread way that researchers try to obtain less biased offline evaluation is by designing new, supposedly unbiased performance metrics for use on MNAR test data. We investigate an alternative solution, a sampling approach. The general idea is to use a sampling strategy on MNAR data to generate an intervened test set with less bias — one in which interactions are Missing At Random (MAR) or, at least, one that is more MAR-like. An existing example of this approach is SKEW, a sampling strategy that aims to adjust for the confounding effect that an item’s popularity has on its likelihood of being observed. In this paper, after extensively surveying the literature on the bias problem in the offline evaluation of RSs, we propose and formulate a novel sampling approach, which we call WTD; we also propose a more practical variant, which we call WTD_H. We compare our methods to SKEW and to two baselines which perform a random intervention on MNAR data. We empirically validate for the first time the effectiveness of SKEW and we show our approach to be a better estimator of the performance that one would obtain on (unbiased) MAR test data. Our strategy benefits from high generality (e.g. it can also be employed for training a recommender) and low overheads (e.g. it does not require any learning).

Highlights

  • Offline evaluation of a recommender system is done using an observed dataset, which records interactions that occur between users and items during a given period in the operation of the recommender system

  • To analyse the difference between the various sampling strategies, we plot the distribution of the rating values of each of the intervened test sets and we compare them with the unbiased test set Dgt ( to the analysis in Marlin et al (2007))

  • Users tend to rate items that they like (Marlin et al, 2007). This difference is less evident in COAT than Webscope R31 (WBR3) and we argue that this is due to the more artificial conditions under which COAT’s Missing Not At Random (MNAR) portion was collected (Schnabel et al, 2016) compared with the MNAR portion of WBR3

Read more

Summary

Introduction

Offline evaluation of a recommender system is done using an observed dataset, which records interactions (e.g. clicks, purchases, ratings) that occur between users and items during a given period in the operation of the recommender system. Using MNAR data in an evaluation as if it were MCAR or MAR, results in biased estimates of a recommender’s performance (Marlin et al, 2007): for example, such experiments tend to incorrectly reward recommenders that recommend popular items or that make recommendations to the more active users (Pradel et al, 2012; Cremonesi et al, 2010). The sampling strategy is chosen so that the intervened test set which results from the sampling is supposed to be less biased (more MAR-like) and more suitable for evaluation of the recommender’s performance. One such sampling strategy is known as SKEW (Liang et al, 2016a): it samples user-item interactions.

Related work
Offline evaluation of recommender systems
The bias problem
Collection of unbiased datasets
Unbiased metrics
Intervened datasets
Our approach to Debiased offline evaluation of recommender systems
Properties of a MAR dataset
Properties of an MNAR dataset
Intervened test sets
The intervention approach
WTD: weights for the sampling
WTD H: hypothesized distributions for the weights
Experiments
Datasets
Methodology
Recommender systems
Results
F R S WT WH F R S WT WH
Conclusions
Findings
Limitations of our study and future works
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call