A Public Chinese Dataset for Language Model Adaptation

Ye Bai,Jiangyan Yi,Jianhua Tao,Cunhang Fan,Zhengqi Wen

doi:10.1007/s11265-019-01482-5

Abstract

A language model (LM) is an important part of a speech recognition system. The performance of an LM is affected when the domains of training data and test data are different. Language model adaptation is to compensate for this mismatch. However, there is no public dataset in Chinese for evaluating language model adaptation. In this paper, we present a public Chinese dataset called CLMAD for language model adaptation. The dataset consists of four domains: sport, stock, fashion, and finance. The differences in these four domains are evaluated. We present baselines for two commonly used adaptation techniques: interpolation for n-gram, and fine-tuning for recurrent neural network language models (RNNLMs). For n-gram interpolation, when the source domain and target domain are relatively similar, the adapted model can be improved. But interpolating LMs of very different domains does not obtain improvement. For RNNLMs, fine-tuning whole network achieves the largest improvement over only fine-tuning softmax layer or embedding layer. When the domain difference is large, the improvement of the adapted RNNLM is significant. We also provide speech recognition results on AISHELL-1 with the LMs trained on CLMAD. CLMAD can be freely downloaded at http://www.openslr.org/55/ .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Public Chinese Dataset for Language Model Adaptation

Abstract

Talk to us

Similar Papers

More From: Journal of Signal Processing Systems

Lead the way for us

Journal: Journal of Signal Processing Systems	Publication Date: Oct 16, 2019
Citations: 2

Similar Papers

CLMAD: A Chinese Language Model Adaptation Dataset
Ye Bai ... Jianhua Tao
-
Ye Bai, et. al.Ye Bai ... Jianhua Tao
01 Nov 2018
01 Nov 2018

Joint unsupervised adaptation of n-gram and RNN language models via LDA-based hybrid mixture modeling
Ryo Masumura ... Hirokazu Masataki
-
Ryo Masumura, et. al.Ryo Masumura ... Hirokazu Masataki
01 Dec 2017
01 Dec 2017

Training RNN language models on uncertain ASR hypotheses in limited data scenarios
Imran Sheikh ... Irina Illina
Computer Speech & Language | VOL. 83
Imran Sheikh, et. al.Imran Sheikh ... Irina Illina
20 Aug 2023
Computer Speech & Language | VOL. 83

Recurrent Neural Network Language Model Adaptation for Multi-Genre Broadcast Speech Recognition and Alignment
Salil Deena ... Madina Hasan
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 27
Salil Deena, et. al.Salil Deena ... Madina Hasan
01 Mar 2019
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Public Chinese Dataset for Language Model Adaptation

Abstract

Talk to us

Similar Papers

More From: Journal of Signal Processing Systems