Abstract

Audio event detection (AED) is a task of recognizing the types of audio events in an audio stream and estimating their temporal positions. AED is typically based on fully supervised approaches, requiring strong labels including both the presence and temporal position of each audio event. However, fully supervised datasets are not easily available due to the heavy cost of human annotation. Recently, weakly supervised approaches for AED have been proposed, utilizing large scale datasets with weak labels including only the occurrence of events in recordings. In this work, we introduce a deep convolutional neural network (CNN) model called DSNet based on densely connected convolution networks (DenseNets) and squeeze-and-excitation networks (SENets) for weakly supervised training of AED. DSNet alleviates the vanishing-gradient problem and strengthens feature propagation and models interdependencies between channels. We also propose a structured prediction method for weakly supervised AED. We apply a recurrent neural network (RNN) based framework and a prediction smoothness cost function to consider long-term contextual information with reduced error propagation. In post-processing, conditional random fields (CRFs) are applied to take into account the dependency between segments and delineate the borders of audio events precisely. We evaluated our proposed models on the DCASE 2017 task 4 dataset and obtained state-of-the-art results on both audio tagging and event detection tasks.

Highlights

  • People experience a variety of audio events with meaningful information that can be useful for human activities

  • We propose a deep convolutional network based on DenseNet and squeeze-and-excitation networks (SENets) for weakly supervised Audio event detection (AED)

  • The results show that the DSNet had an absolute improvement of 0.0347 over the baseline convolutional neural network (CNN) in terms of F1 score

Read more

Summary

Introduction

People experience a variety of audio events with meaningful information that can be useful for human activities. In early studies on AED, several approaches were proposed based on signal processing and machine learning techniques, and recently deep learning based methods have been widely developed. Most of these studies are based on fully supervised learning methods that require strongly labeled data. Either audio event examples are directly provided or the exact time of each audio event is given. Building a large strongly labeled database is a time-consuming and challenging work. For these reasons, there exist only a few publicly available large-scale audio event datasets with strong labels

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call