Abstract

Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained with only a fraction of the labeled data. The commonality between recent SSL methods is that they strongly rely on the augmentation of unannotated data. This is vastly unexplored for audio data. In this work, SSL using the state-of-the-art FixMatch approach is evaluated on three audio classification tasks, including music, industrial sounds, and acoustic scenes. The performance of FixMatch is compared to Convolutional Neural Networks (CNN) trained from scratch, Transfer Learning, and SSL using the Mean Teacher approach. Additionally, a simple yet effective approach for selecting suitable augmentation methods for FixMatch is introduced. FixMatch with the proposed modifications always outperformed Mean Teacher and the CNNs trained from scratch. For the industrial sounds and music datasets, the CNN baseline performance using the full dataset was reached with less than 5% of the initial training data, demonstrating the potential of recent SSL methods for audio data. Transfer Learning outperformed FixMatch only for the most challenging dataset from acoustic scene classification, showing that there is still room for improvement.

Highlights

  • Recent advances in deep learning have resulted in improved performance for many classification tasks

  • We propose a novel method to select the augmentation techniques used during training, as this choice was shown to be critical in a previous study [3] as well as in our experiments

  • Our results showed that the selection of augmentation methods is critical for FM

Read more

Summary

Introduction

Recent advances in deep learning have resulted in improved performance for many classification tasks. Such improvements often come at the expense of large annotated datasets and increasingly larger models. While datasets with the required amount of annotated data to train these models are not always available, unlabeled data can often be obtained. In the field of Acoustic Scene Classification (ASC), for example, edge devices can record large quantities of data at low additional cost. The same holds true for Industrial Sound Analysis (ISA) applications, where acoustic quality control systems can record the observed production process for long periods of time. In the field of Music Information Retrieval (MIR), vast amounts of music recordings can be collected for a given classification task from existing music collections

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call