Automated music emotion recognition (MER) is a challenging task in Music Information Retrieval with wide-ranging applications. Some recent studies pose MER as a continuous regression problem in the Arousal-Valence (AV) plane. These consist of variations on a common architecture having a universal model of emotional response, a common repertoire of low-level audio features, a bag-of-frames approach to audio analysis, and relatively small data sets. These approaches achieve some success at MER and suggest that further improvements are possible with current technology. Our contribution to the state of the art is to examine just how far one can go within this framework, and to investigate what the limitations of this framework are. We present the results of a systematic study conducted in an attempt to maximize the prediction performance of an automated MER system using the architecture described. We begin with a carefully constructed data set, emphasizing quality over quantity. We address affect induction rather than affect attribution. We consider a variety of algorithms at each stage of the training process, from preprocessing to feature selection and model selection, and we report the results of extensive testing. We found that: (1) none of the variations we considered leads to a substantial improvement in performance, which we present as evidence of a limit on what is achievable under this architecture, and (2) the size of the small data sets that are commonly used in the MER literature limits the possibility of improving the set of features used in MER due to the phenomenon of Subset Selection Bias. We conclude with some proposals for advancing the state of the art.
Read full abstract