Similar Word Model for Unfrequent Word Enhancement in Speech Recognition

Xi Ma,Dong Wang,Javier Tejedor

doi:10.1109/taslp.2016.2585863

Abstract

The popular n-gram language model (LM) is weak for unfrequent words. Conventional approaches such as class-based LMs pre-define some sharing structures (e.g., word classes) to solve the problem. However, defining such structures requires prior knowledge, and the context sharing based on these structures is generally inaccurate. This paper presents a novel similar word model to enhance unfrequent words. In principle, we enrich the context of an unfrequent word by borrowing context information from some “similar words.” Compared to conventional class-based methods, this new approach offers a fine-grained context sharing by referring to words that best match the target word, and it is more flexible as no sharing structures need to be defined by hand. Experiments on a large-scale Chinese speech recognition task demonstrated that the similar word approach can improve performance on unfrequent words significantly, while keeping the performance on general tasks almost unchanged.

Full Text