Abstract

This paper presents a novel noise-robust automatic speech recognition (ASR) system that combines aspects of the noise modeling and source separation approaches to the problem. The combined approach has been motivated by the observation that the noise backgrounds encountered in everyday listening situations can be roughly characterized as a slowly varying noise floor in which there are embedded a mixture of energetic but unpredictable acoustic events. Our solution combines two complementary techniques. First, an adaptive noise floor model estimates the degree to which high-energy acoustic events are masked by the noise floor (represented by a soft missing data mask). Second, a fragment decoding system attempts to interpret the high-energy regions that are not accounted for by the noise floor model. This component uses models of the target speech to decide whether fragments should be included in the target speech stream or not. Our experiments on the CHiME corpus task show that the combined approach performs significantly better than systems using either the noise model or fragment decoding approach alone, and substantially outperforms multicondition training.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.