Implicit language information replacing method in Japanese encoder–decode ASR model

Daiki Mori,Ryota Nishimura,Kengo Ohta,Norihide Kitaoka

doi:10.1109/icaicta56449.2022.9932915

Abstract

Recent years, automatic speech recognition (ASR) tasks often use language models as an adjunct to ASR models. The density ratio approach (DRA) is one of several language model integration methods. It is known that Japanese has a much larger number of characters than the alphabet language, and that there are variations in reading with homonyms and the same characters. It was unclear whether the “implicit language information” of character-based encoder–decoder ASR model using beam search algorithm is approximated by the external language model. In our experiments, We have applied an DRA to a Japanese encoder–decoder ASR model to reduce character error rate (CER) in cross-domain scenarios. Cross-domain CERs were calculated for the Japanese academic presentation speech (APS) corpus and the Japanese simulated presentation speech (SPS) corpus. This method achieved a relative error reduction of 11.0% and 22.5% with the RNN and Transformer models compared to the shallow fusion. To investigate the applicability of different speaking styles to different domains, We also conducted an experiment to replace the “implicit language information” inside the CSJ ASR model with Mainichi Shimbun language model. For the JNAS task, the DRA achieved a relative error reduction of 7.3% compared to the shallow fusion method.

Full Text