Efficient Music Genre Classification with Deep Convolutional Neural Networks

Wency Suo

doi:10.1109/dsit55514.2022.9943952

Abstract

In this paper, music genres classification with deep neural networks are explored. A Convolutional Neural Network (CNN) was trained to identify 10 different music genres. Using the GTZAN dataset, 1000 files were split into train data and test data. The accuracy, loss, and time were monitored for each epoch when training and testing. By comparing the accuracy and speed from an optimized CNN to the results of a human baseline model, the hypothesis that the deep neural network will outperform the baseline is tested. We transformed the audio sample to image via combining Short Time Fourier Transform(STFT) and the Mel-frequency spectrum, which can contribute to the efficiency and accuracy of our model. Our result suggests that the CNN model is much more efficient when compared to a human baseline. The human baseline correctly classified 43.3% of the samples. The CNN achieved an accuracy of 98% in 37 sec on the train set and 68.7 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">%</sup> in 36 sec on the test set. This study shows that deep learning is an efficient tool to classifying music genres since the accuracy of the network is higher than that of the human baseline model. Using image recognition to search for similar patterns within each audio sample, a neural network can more quickly and accurately sort music samples into the correct genre. There is still room for improvement to be made on the model itself but experimental data shows the potential for CNN in music genre classification and recommendation.

Full Text