Abstract

Supervised machine learning relies on the accessibility of large datasets of annotated data. This is essential since small datasets generally lead to overfitting when training high-dimensional machine-learning models. Since the manual annotation of such large datasets is a long, tedious and expensive process, another possibility is to artificially increase the size of the dataset. This is known as data augmentation. In this paper we provide an in-depth analysis of two data augmentation methods: sound transformations and sound segmentation. The first transforms a music track to a set of new music tracks by applying processes such as pitch-shifting, time-stretching or filtering. The second one splits a long sound signal into a set of shorter time segments. We study the effect of these two techniques (and the parameters of those) for a genre classification task using public datasets. The main contribution of this work is to detail by experimentation the benefit of these methods, used alone or together, during training and/or testing. We also demonstrate their use in improving the robustness of potentially unknown sound degradations. By analyzing these results, good practice recommendations are provided.

Highlights

  • A common task in Music Information Retrieval (MIR) is the prediction of metadata based on the music signal content itself, e.g. in audio classification, musical structure segmentation, tempo prediction, fundamental frequency estimation

  • When only few examples can be manually annotated, we investigate the use of Data Augmentation which allows to artificially increase the size of the training dataset

  • The purpose of this paper is to demonstrate the benefit of data augmentation and not to achieve the highest possible recognition rate for musical genre classification

Read more

Summary

Introduction

A common task in Music Information Retrieval (MIR) is the prediction of metadata based on the music signal content itself, e.g. in audio classification, musical structure segmentation, tempo prediction, fundamental frequency estimation. Whereas some of the methods are based on known properties which can be directly evaluated using dedicated algorithms, other techniques need a number of annotated examples to enable automatic learning of the discriminant characteristics which help to solve the given problem: this is called supervised training. The use of too small or unrepresentative datasets usually leads to overfitting with prediction methods which have a high level of complexity. In this case, the trained models may focus on sound properties which discriminate the few given examples, but which are irrelevant in a general way.

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call