Abstract

Speech communications and interactions frequently occur in a variety of environments. Noise in the environment significantly degrades speech intelligibility when speaking and listening. Especially in the listening stage, even if the multimedia terminal outputs clean speech, it is still difficult for listeners to obtain information. Intelligibility enhancement (IENH) of speech is a technique for overcoming the environmental noise in the listening stage. It implements a perceptual enhancement of non-noisy speech. This study focuses on IENH via normal-to-Lombard speech conversion, inspired by a well known acoustic mechanism named the Lombard effect. Our method combines the long short-term memory (LSTM) network and Bayesian Gaussian mixture model (BGMM) to build a conversion architecture. Compared with baselines, it has three main advantages: 1) an LSTM network is used for spectral tilt mapping with fully considering short-term correlations and high-dimensional expression abilities; 2) the aperiodicity (AP) is mapped together with the fundamental frequency ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> ) by a BGMM, which considers their relevance constraints and the importance of APs; 3) the gender-dependent mapping is used for <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> and APs to consider distribution differences between genders. Experiments indicate that our method gets better performance in both objective and subjective tests.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call