Whispered Speech Conversion Based on the Inversion of Mel Frequency Cepstral Coefficient Features

Qiang Zhu,Zhong Wang,Jian Zhou,Yunfeng Dou

doi:10.3390/a15020068

Qiang Zhu, Zhong Wang + Show 2 more

Open Access

https://doi.org/10.3390/a15020068

Copy DOI

Journal: Algorithms	Publication Date: Feb 20, 2022
Citations: 2	License type: CC BY 4.0

Affiliation: Hefei Normal University, Anhui University

Abstract

A conversion method based on the inversion of Mel frequency cepstral coefficient (MFCC) features was proposed to convert whispered speech into normal speech. First, the MFCC features of whispered speech and normal speech were extracted and a matching relation between the MFCC feature parameters of whispered speech and normal speech was developed through the Gaussian mixture model (GMM). Then, the MFCC feature parameters of normal speech corresponding to whispered speech were obtained based on the GMM and, finally, whispered speech was converted into normal speech through the inversion of MFCC features. The experimental results showed that the cepstral distortion (CD) of the normal speech converted by the proposed method was 21% less than that of the normal speech converted by the linear predictive coefficient (LPC) features, the mean opinion score (MOS) was 3.56, and a satisfactory outcome in both intelligibility and sound quality was achieved.

Highlights

Whispered speech is a method of articulation different from normal speech [1]; it is produced without vibration of the vocal cords at a low sound level, which causes the voiced sound of whispered speech to have no fundamental frequency and an energy 20 dB less than that of normal speech [2]
We report a method for converting whispered speech to normal speech based on Mel frequency cepstral coefficient (MFCC) and Gaussian mixture model (GMM)
To consider the sparseness of speech, we proposed to use the L1/2 algorithm to invert the MFCC features, which generates a good hearing effect

Summary

Introduction

Whispered speech is a method of articulation different from normal speech [1]; it is produced without vibration of the vocal cords at a low sound level, which causes the voiced sound of whispered speech to have no fundamental frequency and an energy 20 dB less than that of normal speech [2]. Because of these characteristics, whispered speech is widely used in places where loud noises are prohibited such as conference rooms, libraries, and concert halls.

Methods

Results

Conclusion