Abstract
This paper proposes a correct phoneme sequence estimation method that uses a recurrent neural network (RNN)-based framework for spoken term detection (STD). It is important to reduce automatic speech recognition (ASR) errors to obtain good STD results. Therefore, we use a long short-term memory (LSTM), which is one of an RNN architecture, for estimating a correct phoneme sequence of an utterance from phoneme-based transcriptions produced by ASR systems in post-processing of ASR. We prepare two types of LSTM-based phoneme estimators: one is trained with a single ASR system's N-best output and the other is trained with multiple ASR systems' 1-best outputs. For an experiment on a correct phoneme estimation task, these LSTM-based estimators could generate better phoneme-based N-best transcriptions rather than the best ASR system's ones. In particular, the estimator trained with multiple ASR systems' outputs worked well on the estimation task. Besides, the STD system with the LSTM estimator drastically improved STD performance compared to our previously proposed STD system with a conditional random field-based phoneme estimator.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.