Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection

Inkyu Choi,Nam Soo Kim,Soo Hyun Bae

doi:10.3390/app9112302

Inkyu Choi, Nam Soo Kim + Show 1 more

Open Access

PDF Available

https://doi.org/10.3390/app9112302

Copy DOI

Export

Save

Cite

Journal: Applied Sciences	Publication Date: Jun 4, 2019
Citations: 3	License type: CC BY 4.0

Affiliation: Seoul National University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Audio event detection (AED) is a task of recognizing the types of audio events in an audio stream and estimating their temporal positions. AED is typically based on fully supervised approaches, requiring strong labels including both the presence and temporal position of each audio event. However, fully supervised datasets are not easily available due to the heavy cost of human annotation. Recently, weakly supervised approaches for AED have been proposed, utilizing large scale datasets with weak labels including only the occurrence of events in recordings. In this work, we introduce a deep convolutional neural network (CNN) model called DSNet based on densely connected convolution networks (DenseNets) and squeeze-and-excitation networks (SENets) for weakly supervised training of AED. DSNet alleviates the vanishing-gradient problem and strengthens feature propagation and models interdependencies between channels. We also propose a structured prediction method for weakly supervised AED. We apply a recurrent neural network (RNN) based framework and a prediction smoothness cost function to consider long-term contextual information with reduced error propagation. In post-processing, conditional random fields (CRFs) are applied to take into account the dependency between segments and delineate the borders of audio events precisely. We evaluated our proposed models on the DCASE 2017 task 4 dataset and obtained state-of-the-art results on both audio tagging and event detection tasks.

Highlights

People experience a variety of audio events with meaningful information that can be useful for human activities
We propose a deep convolutional network based on DenseNet and squeeze-and-excitation networks (SENets) for weakly supervised Audio event detection (AED)
The results show that the DSNet had an absolute improvement of 0.0347 over the baseline convolutional neural network (CNN) in terms of F1 score

Summary

Introduction

People experience a variety of audio events with meaningful information that can be useful for human activities. In early studies on AED, several approaches were proposed based on signal processing and machine learning techniques, and recently deep learning based methods have been widely developed. Most of these studies are based on fully supervised learning methods that require strongly labeled data. Either audio event examples are directly provided or the exact time of each audio event is given. Building a large strongly labeled database is a time-consuming and challenging work. For these reasons, there exist only a few publicly available large-scale audio event datasets with strong labels

Methods

Results

Conclusion