Abstract
When producing speech in noisy backgrounds talkers reflexively adapt their speaking style in ways that increase speech-in-noise intelligibility. This adaptation, known as the Lombard effect, is likely to have an adverse effect on the performance of automatic speech recognition systems that have not been designed to anticipate it. However, previous studies of this impact have used very small amounts of data and recognition systems that lack modern adaptation strategies. This paper aims to rectify this by using a new audio-visual Lombard corpus containing speech from 54 different speakers – significantly larger than any previously available – and modern state-of-the-art speech recognition techniques.The paper is organised as three speech-in-noise recognition studies. The first examines the case in which a system is presented with Lombard speech having been exclusively trained on normal speech. It was found that the Lombard mismatch caused a significant decrease in performance even if the level of the Lombard speech was normalised to match the level of normal speech. However, the size of the mismatch was highly speaker-dependent thus explaining conflicting results presented in previous smaller studies. The second study compares systems trained in matched conditions (i.e., training and testing with the same speaking style). Here the Lombard speech affords a large increase in recognition performance. Part of this is due to the greater energy leading to a reduction in noise masking, but performance improvements persist even after the effect of signal-to-noise level difference is compensated. An analysis across speakers shows that the Lombard speech energy is spectro-temporally distributed in a way that reduces energetic masking, and this reduction in masking is associated with an increase in recognition performance. The final study repeats the first two using a recognition system training on visual speech. In the visual domain, performance differences are not confounded by differences in noise masking. It was found that in matched-conditions Lombard speech supports better recognition performance than normal speech. The benefit was consistently present across all speakers but to a varying degree. Surprisingly, the Lombard benefit was observed to a small degree even when training on mismatched non-Lombard visual speech, i.e., the increased clarity of the Lombard speech outweighed the impact of the mismatch.The paper presents two generally applicable conclusions: i) systems that are designed to operate in noise will benefit from being trained on well-matched Lombard speech data, ii) the results of speech recognition evaluations that employ artificial speech and noise mixing need to be treated with caution: they are overly-optimistic to the extent that they ignore a significant source of mismatch but at the same time overly-pessimistic in that they do not anticipate the potential increased intelligibility of the Lombard speaking style.
Highlights
Automatic speech recognition is finding widespread application in everyday environments
We summarise the main characteristics of Lombard speech and review the impact of the Lombard effect on speech intelligibility
In order to measure how much of the mismatch effect is due to this alone, a further set of ‘compensated’ Lombard (CL) noisy utterances are generated in which the Lombard utterances are normalised to the same normalisation energy as the non-Lombard utterances before adding the noise, i.e., this set of noisy Lombard utterances will be at an SNR that matches the non-Lombard data
Summary
Automatic speech recognition is finding widespread application in everyday environments. Speakers are sensitive to this effect and, in challenging communication settings, they reflexively adapt their speech production in ways that counter the effects of noise masking This adaptation which includes an increase in signal energy, a tilt of the speech spectrum and an increase in vowel duration, has become known as the Lombard effect, named after Étienne Lombard who first described it in 1909 (Lombard, 1911; Brumm and Zollinger, 2011). This effect will be present to greater or lesser extent whether humans are conversing with a human partner or with an automatic recognition system. In this paper we follow the classic study of Junqua (1993) in emphasising the former by using masking noise to induce the effect while talkers read sentence lists
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.