Multimodal Fusion for Segment Classification in Folk Music

Aravind Krishnan,Rajeev Rajan,Geevar Jos,Amal Vincent

doi:10.1109/indicon52576.2021.9691751

Abstract

A folk music segment classification system that uses a multimodal fusion of acoustic features, textual information and duration based feature on Thiruvathirakali music corpus is proposed. Acoustic features are learned from musical texture features (MTF) using a long short term memory (LSTM) model. A term frequency-inverse document frequency (TF-IDF) model is employed to derive text-based features from transcription data. For multimodal fusion, early integration of the LSTM derived features, TF-IDF features and duration feature is employed. An attempt to optimise the LSTM model is carried out through frame fusion in the temporal domain. Frame fusion is seen to increase classification efficiency by 13 percent and reduce computational expense by tenfold. The system reports an overall precision, recall and F1 measure of 0.53, 0.52 and 0.51 respectively for an LSTM model with frame fusion, with better performance over a baseline SVM classifier. The classification efficiency is seen to improve by 15 percent (absolutely) with the addition of each multimodal component. For complete multimodal fusion, the metrics improve to 0.83, 0.78 and 0.80 respectively.

Full Text