Knowledge Distillation in Acoustic Scene Classification

Jee-Weon Jung,Hee-Soo Heo,Hye-Jin Shim,Ha-Jin Yu

doi:10.1109/access.2020.3021711

Jee-Weon Jung, Hee-Soo Heo + Show 2 more

Open Access

https://doi.org/10.1109/access.2020.3021711

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 26	License type: CC BY 4.0

Affiliation: Naver (South Korea), University of Seoul

Abstract

Common acoustic properties that different classes share degrades the performance of acoustic scene classification systems. This results in a phenomenon where a few confusing pairs of acoustic scenes dominate a significant proportion of all misclassified audio segments. In this article, we propose adopting a knowledge distillation framework that trains deep neural networks using soft labels. Soft labels, extracted from another pre-trained deep neural network, are used to reflect the similarity between different classes that share similar acoustic properties. We also propose utilizing specialist models to provide additional soft labels. Each specialist model in this study refers to a deep neural network that concentrates on discriminating a single pair of acoustic scenes that are frequently misclassified. Self multi-head attention is explored for training specialist deep neural networks to further concentrate on target pairs of classes. The goal of this article is to train a single deep neural network that demonstrates performance equivalent to, or higher than, an ensemble of multiple models, by distilling the knowledge from several models. Diverse experiments conducted using the detection and classification of acoustic scenes and events 2019 task 1-a dataset demonstrate that the knowledge distillation framework is effective in acoustic scene classification. Specialist models successfully decrease the number of misclassified audio segments in the target classes. The final single model with the proposed method that is trained by the proposed knowledge distillation from several models, including specialists trained using an attention mechanism, shows a classification accuracy of 77.63 %, higher than an ensemble of the baseline and multiple specialists.

Highlights

Acoustic scene classification (ASC) is a multi-class classification task that classifies an input audio segment into one of the pre-defined acoustic scenes
We present an overall adaptation of the knowledge distillation (KD) framework for the ASC task with hypothesis and problematic phenomena in which our hypothesis evokes [24], [27]
OVERVIEW In this research, we show that adapting the KD framework to the ASC task is effective by improving the overall classification accuracy, and by lowering the number of misclassified audio segments in pairs of acoustic scenes that are frequently misclassified

Summary

Introduction

Acoustic scene classification (ASC) is a multi-class classification task that classifies an input audio segment into one of the pre-defined acoustic scenes. For studies in ASC task, the detection and classification of acoustic scenes and events (DCASE) community is providing a common platform including datasets released annually for researches to study and report results [5]–[8]. Common acoustic properties that reside among different acoustic scenes cause performance degradation of the ASC task [24]. Audio segments that contain the same babbling sound are labelled as airport and shopping_mall depending on the location of the recording. One phenomenon that these common acoustic properties evoke is that a few pairs of acoustic scenes occupy the

Objectives

Results

Conclusion