Abstract

Monaural (single-channel) recording is sometimes used for telephone conversations in call centers. Generally speaking, the accuracy of automatic speech recognition of a monaural recording is worse than that of the multi-channel recording of the same conversation where each speaker's voice is separately recorded. The major reason is that the recognition system fails not only at the overlapping segments where the voices of the multiple speakers overlap, but also at the neighboring segments surrounding the overlapping segments. In this paper, we tackle this problem by using a combination of garbage modeling and noise-robust monaural acoustic modeling. Our proposed method trains the models by making use of multi-channel recordings and transcripts, which are relatively easy to prepare than monaural recordings and transcripts. We present experimental results where the proposed methods reduced the error rates by approximately 3% relative to the baseline methods for both of GMM-HMM and CNN-HMM cases. Because the proposed method is quite simple, the proposed method is easy to deploy to wide range of ASR systems for monaural speech transcription.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call