Hindustani raga and singer classification using 2D and 3D pose estimation from video recordings

Martin Clayton,Jin Li,Alison Clarke,Marion Weinzierl

doi:10.1080/09298215.2024.2331788

Abstract

Using pose estimation with video recordings, we apply an action recognition machine learning algorithm to demonstrate the use of the movement information to classify singers and the ragas (melodic modes) they perform. Movement information is derived from a specially recorded video dataset of solo Hindustani (North Indian) raga recordings by three professional singers each performing the same nine ragas, a smaller duo dataset (one singer with tabla accompaniment) as well as recordings of concert performances by the same singers. Data is extracted using pose estimation algorithms, both 2D (OpenPose) and 3D. A two-pathway convolutional neural network structure is proposed for skeleton action recognition to train a model to classify 12-second clips by singer and raga. The model is capable of distinguishing the three singers on the basis of movement information alone. For each singer, it is capable of distinguishing between the nine ragas with a mean accuracy of 38.2% (with the most successful model). The model trained on solo recordings also proved effective at classifying duo and concert recordings. These findings are consistent with the view that while the gesturing of Indian singers is idiosyncratic, it remains tightly linked to patterns of melodic movement: indeed we show that in some cases different ragas are distinguishable on the basis of movement information alone. A series of technical challenges are identified and addressed, with code shared alongside audiovisual data to accompany the paper.

Full Text