Abstract

Missing data methods attempt to improve robust speech recognition by distinguishing between reliable and unreliable data in the time-frequency domain. Such methods require a binary mask which labels time-frequency regions of a noisy speech signal as reliable if they contain more speech energy than noise energy and unreliable otherwise. Current methods for estimating the mask are based mainly on bottom-up speech separation cues such as harmonicity and produce labeling errors that cause a degradation in recognition performance. We propose a two stage recognition system in order to improve mask estimation and produce better recognition results. First, an n-best lattice consistent with the speech separation mask is generated. The lattice is then re-scored by expanding the mask using a model-based hypothesis test to determine the reliability of individual time-frequency regions. Systematic evaluations show significant improvement in recognition performance compared to that using speech separation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call