Instrument recognition in unprocessed musical audio through transformer-based modeling

Congren Dai

doi:10.54254/2755-2721/78/20240682

Abstract

This study focuses on identifying primary instruments in musical audio using an adapted Wav2Vec 2.0 model, initially intended for extracting speech features from raw audio. Modifications to the model's convolutional layers and transformer element were made to facilitate the recognition of instruments in complex audio mixes. The task of instrument recognition is approached as a multi-labelled classification problem. The effectiveness of the model is measured through accuracy, precision, recall, F1-score, and analysis via a confusion matrix. Key findings reveal the model's differential efficiency in recognising various instruments, with notable success in detecting violins, pianos, saxophones, and human voices. However, the model encounters difficulties in recognising instruments with a narrower dynamic range or lower volume, like the organ that may provide harmonic support, and those with scarce representation, such as the cello and clarinet. The research also indicates that while pre-separation of certain instruments like guitars may enhance recognition, it may not be necessary for others.

Full Text