Abstract

In training a deep learning system to perform audio transcription, two practical problems may arise. Firstly, most datasets are weakly labelled, having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose factorising the final task of audio transcription into multiple intermediate tasks in order to improve the training performance when dealing with this kind of low-resource datasets. We evaluate three data-efficient approaches of training a stacked convolutional and recurrent neural network for the intermediate tasks. Our results show that different methods of training have different advantages and disadvantages.

Highlights

  • Machine learning has experienced a strong growth in recent years, due to increased dataset sizes and computational power, and to advances in deep learning methods that can learn to make predictions in extremely nonlinear problem settings [1]

  • We propose a factorisation of the final full transcription task into multiple simpler intermediate tasks of audio event detection and audio tagging in order to predict an intermediate transcription that can be used to boost the performance of the full transcription task

  • Many low-resource datasets are usually used for discriminating subclasses of a general class e.g., song of different bird species, sound of different car engines, barking of different dog breeds, and notes produced by an instrument. These subclasses usually share some common features and characteristics, in order to achieve a good performance in the audio event detection task, we propose considering all subclasses as one general class and train a single WHEN network to perform single class event detection

Read more

Summary

Introduction

Machine learning has experienced a strong growth in recent years, due to increased dataset sizes and computational power, and to advances in deep learning methods that can learn to make predictions in extremely nonlinear problem settings [1]. With the increased amount of audio datasets publicly available, there is an increase of tagging labels available for them. We refer to these tagging labels, which only indicate the presence or not of a type of event in a recording and lack any temporal information about it, as weak labels. In [3], the authors proposed to use a shrinking deep neural network incorporating unsupervised feature learning to handle the multi-label audio tagging. In [4,5], the authors use a stacked convolutional recurrent network to perform environmental audio tagging and tag the presence of birdsong, respectively. In [6], the authors explore two different models for end-to-end music audio tagging when there is a large amount of training data

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.