BECMER: A Fusion Model Using BERT and CNN for Music Emotion Recognition

Bo-Hsun Sung,Shih-Chieh Wei

doi:10.1109/iri51335.2021.00068

Abstract

Music emotion analysis has been an ever-growing field of research in music in-formation retrieval. To solve the cold start problem of content-based recommendation systems, a method for automatic music labeling is needed. Due to recent advances, neural networks can be used to extract audio features for a wide variety of tasks. When humans listen to a song, it is the music or the lyrics that touch the heart the most. Therefore, this study will try to predict the type of music emotion based on the audio signal and the lyrics information. For model building, convolutional neural networks (CNNs) will be used on the audio signals and natural language processing (NLP) models on the lyrics. A new dataset ABP is compiled from three datasets of Western pop music where each song contains valence and arousal values judged by humans. The type of music emotion will be categorized based on the four quadrants formed by the valence and arousal axes. It is con-firmed in the experiment that use of audio and lyrics information to classify the emotions of songs has a better classification performance than use of the au-dio-only learning methods in previous studies. Compared with a related work, this study has improved the accuracy of the audio model and the lyrics model by 8~16%.

Full Text