Abstract

Our work is concerned with the subjective perception of music similarity in the context of music recommendation. We present two user studies to explore inter- and intra-rater agreement in quantification of general similarity between pieces of recommended music. Contrary to previous efforts, our test participants are of more uniform age and share a comparable musical background to lower variation within the participant group. The first study uses carefully curated song material from five distinct genres while the second uses songs from a single genre only, with almost all songs in both studies previously unknown to test participants. Repeating the listening tests with a two week lag shows that intra-rater agreement is higher than inter-rater agreement for both studies. Agreement for the single genre study is lower since genre of songs seems a major factor in judging similarity between songs. Mood of raters at test-time is found to have an influence on intra-rater agreement. We discuss the impacts of our results on evaluation of music recommenders and question the validity of experiments on general music similarity.

Highlights

  • The automatic recommendation of music or creation of playlists is one of the successful applications of Music Information Retrieval (MIR) and is commonplace in music streaming services like Spotify, Deezer, Pandora and Tidal

  • Study B is closer to a real-life music recommendation scenario with all songs belonging to a single genre

  • The 15 correlations between the six graders range from 0.59 to 0.86, with an average of 0.73 at t1 and 0.75 at t2. This is considerably higher than ρAMS = 0.40 which was reported for the MIREX AMS task 2006 (Flexer and Grill, 2016)

Read more

Summary

Introduction

The automatic recommendation of music or creation of playlists is one of the successful applications of Music Information Retrieval (MIR) and is commonplace in music streaming services like Spotify, Deezer, Pandora and Tidal These services recommend music which is in some way similar to what users have been listening to previously. Previous research (Jones et al, 2007; Ni et al, 2013; Schedl et al, 2013; Flexer and Grill, 2016; Koops et al, 2019) made it clear that human perception and experience of music similarity is highly subjective with low inter-rater agreement This is true for perception of general music similarity, i.e. when listeners are asked to evaluate similarity between songs without any more specific explanations of what aspects of the music they should focus on. Since it is not meaningful to have computational models that go beyond the level of human agreement, such levels of inter-rater agreement present a natural upper bound for any algorithmic approach (Smith and Chew, 2013; Nieto et al, 2014; Serra et al, 2014; Flexer and Grill, 2016; Koops et al, 2019)

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call