Do Scale-Design and Training Matter for Video QoE Assessments through Crowdsourcing?

Bruno Gardlo,Tobias Hossfeld,Sebastian Egger

doi:10.1145/2810188.2810193

Abstract

Crowdsourcing (CS) has evolved into a mature assessment methodology for subjective experiments in diverse scientific fields and in particular for QoE assessment. However, the results acquired for absolute category rating (ACR) scales through CS are often not fully comparable to QoE assessments done in laboratory environments. A possible reason for such differences may be the scale usage heterogeneity problem caused by deviant scale usage of the crowd workers. In this paper, we study different implementations of (quality) rating scales (in terms of design and number of answer categories) in order to identify if certain scales can help to overcome scale usage problems in crowdsourcing. Additionally, training of subjects is well known to enhance result quality for laboratory ACR evaluations. Hence, we analyzed the appropriateness of training conditions to overcome scale usage problems across different samples in crowdsourcing. As major results, we found that filtering of user ratings and different scale designs are not sufficient to overcome scale usage heterogeneity, but training sessions despite their additional costs, enhance result quality in CS and properly counterfeit the identified scale usage heterogeneity problems.

Full Text