N-HANS: A neural network-based toolkit for in-the-wild audio enhancement

Shuo Liu,Björn Schuller,Gil Keren,Emilia Parada-Cabaleiro

doi:10.1007/s11042-021-11080-y

Shuo Liu, Björn Schuller + Show 2 more

Open Access

https://doi.org/10.1007/s11042-021-11080-y

Copy DOI

Abstract

The unprecedented growth of noise pollution over the last decades has raised an always increasing need for developing efficient audio enhancement technologies. Yet, the variety of difficulties related to processing audio sources in-the-wild, such as handling unseen noises or suppressing specific interferences, makes audio enhancement a still open challenge. In this regard, we present N-HANS (the Neuro-Holistic Audio-eNhancement System), a Python toolkit for in-the-wild audio enhancement that includes functionalities for audio denoising, source separation, and —for the first time in such a toolkit—selective noise suppression. The N-HANS architecture is specially developed to automatically adapt to different environmental backgrounds and speakers. This is achieved by the use of two identical neural networks comprised of stacks of residual blocks, each conditioned on additional speech- and noise-based recordings through auxiliary sub-networks. Along to a Python API, a command line interface is provided to researchers and developers, both of them carefully documented. Experimental results indicate that N-HANS achieves great performance w. r. t. existing methods, preserving also the audio quality at a high level; thus, ensuring a reliable usage in real-life application, e. g., for in-the-wild speech processing, which encourages the development of speech-based intelligent technology.

Highlights

Noise pollution has become an indiscernible limitation of today’s society
9.58 results achieved for the LibriSpeech and AudioSet corpora indicate that Neuro-Holistic Audio-eNhancement System1 (N-HANS) produces an audio output of reliable quality—in comparison to other systems [22, 74]—in terms of speech distortion, as indicated by the levels of log spectral distortion (LSD), signal-to-distortion ratio (SDR), and Mel cepstral distortion (MCD)
The performance of N-HANS as a speech separation system was compared with the outcomes of two baseline models re-implemented on the VoxCeleb dataset [18]: one based on Deep Clustering (DC) [10, 11]; the other based on Conv-Tasnet [36]

Summary

Introduction

Noise pollution has become an indiscernible limitation of today’s society. Through a constant increment in magnitude and severity [9], environmental noise impairs human’s health and well-being more than ever before [15]. Negative background auditory interferences, such as those produces by transportation noise [38], industrial noise [1], or urban noise [79], impair human’s cognitive [62, 73] and communicative [44, 45] skills, and limit the performance of general audio and specific speech-driven applications, such as, automatic speech recognition [40], speech emotion recognition [2, 61], and speaker verification [24, 28, 37, 56]. Two of the main procedures for enhancing audio and speech are source separation and denoising: The former aims to extract a target audio from a mixture of multiple overlapping signals [66]; the latter attempts to suppress the background noise [49, 78]. With the advance of artificial intelligence, neural network based models for audio enhancement have been presented, being efficient in source separation [18, 20, 35, 36, 68] and denoising [5, 27, 32, 34, 47, 60, 71] tasks—the performance of classic algorithms is often overtaken by artificial neural networks [7]

Methods

Results

Conclusion