A sampling approach to Debiasing the offline evaluation of recommender systems

Diego Carraro,Derek Bridge

doi:10.1007/s10844-021-00651-y

Diego Carraro, Derek Bridge

Open Access

https://doi.org/10.1007/s10844-021-00651-y

Copy DOI

Abstract

Offline evaluation of recommender systems (RSs) mostly relies on historical data, which is often biased. The bias is a result of many confounders that affect the data collection process. In such biased data, user-item interactions are Missing Not At Random (MNAR). Measures of recommender system performance on MNAR test data are unlikely to be reliable indicators of real-world performance unless something is done to mitigate the bias. One widespread way that researchers try to obtain less biased offline evaluation is by designing new, supposedly unbiased performance metrics for use on MNAR test data. We investigate an alternative solution, a sampling approach. The general idea is to use a sampling strategy on MNAR data to generate an intervened test set with less bias — one in which interactions are Missing At Random (MAR) or, at least, one that is more MAR-like. An existing example of this approach is SKEW, a sampling strategy that aims to adjust for the confounding effect that an item’s popularity has on its likelihood of being observed. In this paper, after extensively surveying the literature on the bias problem in the offline evaluation of RSs, we propose and formulate a novel sampling approach, which we call WTD; we also propose a more practical variant, which we call WTD_H. We compare our methods to SKEW and to two baselines which perform a random intervention on MNAR data. We empirically validate for the first time the effectiveness of SKEW and we show our approach to be a better estimator of the performance that one would obtain on (unbiased) MAR test data. Our strategy benefits from high generality (e.g. it can also be employed for training a recommender) and low overheads (e.g. it does not require any learning).

Highlights

Offline evaluation of a recommender system is done using an observed dataset, which records interactions that occur between users and items during a given period in the operation of the recommender system
To analyse the difference between the various sampling strategies, we plot the distribution of the rating values of each of the intervened test sets and we compare them with the unbiased test set Dgt ( to the analysis in Marlin et al (2007))
Users tend to rate items that they like (Marlin et al, 2007). This difference is less evident in COAT than Webscope R31 (WBR3) and we argue that this is due to the more artificial conditions under which COAT’s Missing Not At Random (MNAR) portion was collected (Schnabel et al, 2016) compared with the MNAR portion of WBR3

Summary

Introduction

Offline evaluation of a recommender system is done using an observed dataset, which records interactions (e.g. clicks, purchases, ratings) that occur between users and items during a given period in the operation of the recommender system. Using MNAR data in an evaluation as if it were MCAR or MAR, results in biased estimates of a recommender’s performance (Marlin et al, 2007): for example, such experiments tend to incorrectly reward recommenders that recommend popular items or that make recommendations to the more active users (Pradel et al, 2012; Cremonesi et al, 2010). The sampling strategy is chosen so that the intervened test set which results from the sampling is supposed to be less biased (more MAR-like) and more suitable for evaluation of the recommender’s performance. One such sampling strategy is known as SKEW (Liang et al, 2016a): it samples user-item interactions.

Related work

Offline evaluation of recommender systems

The bias problem

Collection of unbiased datasets

Unbiased metrics

Intervened datasets

Our approach to Debiased offline evaluation of recommender systems

Properties of a MAR dataset

Properties of an MNAR dataset

Intervened test sets

The intervention approach

WTD: weights for the sampling

WTD H: hypothesized distributions for the weights

Experiments

Datasets

Methodology

Recommender systems

Results

F R S WT WH F R S WT WH

Conclusions

Findings

Limitations of our study and future works

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of intelligent information systems	Publication Date: Jul 10, 2021
Citations: 7	License type: open-access

R Discovery Prime

R Discovery Prime

A sampling approach to Debiasing the offline evaluation of recommender systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of intelligent information systems

Lead the way for us

Similar Papers

Debiased offline evaluation of recommender systems
Diego Carraro ... Derek Bridge
-
Diego Carraro, et. al.Diego Carraro ... Derek Bridge
30 Mar 2020
30 Mar 2020

Multiple Imputation for Dichotomous MNAR Items Using Recursive Structural Equation Modeling With Rasch Measures as Predictors
Celeste Combrinck ... Vanessa Scherman
SAGE Open | VOL. 8
Celeste Combrinck, et. al.Celeste Combrinck ... Vanessa Scherman
01 Jan 2018
SAGE Open | VOL. 8

What is missing from my missing data plan?
Sharon D Yeatts ... Renée H Martin
Stroke | VOL. 46
Sharon D Yeatts, et. al.Sharon D Yeatts ... Renée H Martin
07 May 2015
Stroke | VOL. 46

BayesMetab: treatment of missing values in metabolomic studies using a Bayesian modeling approach
Jasmit Shah ... Guy N Brock
BMC bioinformatics | VOL. 20
Jasmit Shah, et. al.Jasmit Shah ... Guy N Brock
01 Dec 2019
BMC bioinformatics | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A sampling approach to Debiasing the offline evaluation of recommender systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of intelligent information systems