Abstract

Automatic music recommendation systems based on human emotions are becoming popular nowadays. Since audio and lyrics can provide a rich set of information regarding a song, a fusion model including both modalities can enhance classification accuracy and is attempted in this paper. The main objective of the paper is to address music mood classification using various attention mechanisms, namely, self-attention (SA), channel attention (CA), and hierarchical attention network (HAN), on a multi-modal music mood classification system. Experimental results show that multi-modal architectures with attention have achieved higher accuracy than multi-modal architectures without attention and uni-modal architectures. Motivated by the performance of attention mechanisms, a new network architecture, HAN-CA-SA based multi-modal classification system, is proposed, which reported an accuracy of 82.35%. ROC and Kappa are also computed to see the efficacy of the proposed model. The proposed model is also evaluated using the K-fold cross-validation technique. The performance of the proposed model is compared with that of XLNet and CNN-BERT systems. In addition, McNemar’s statistical hypothesis test is conducted to reaffirm the importance of the proposed approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call