Abstract

Large vocabulary continuous speech recognition (LVCSR) has naturally been demanded for transcribing daily conversations, while developing spoken text data to train LVCSR is costly and time-consuming. In this paper, we propose a classification-based method to automatically select social media data for constructing a spoken-style language model in LVCSR. Three classification techniques, SVM, CRF, and LSTM, trained by words and parts-of-speech are comparatively experimented to identify the degree of spoken style in each social media sentence. Spoken-style utterances are chosen by incremental greedy selection based on the score of the SVM or the CRF classifier or the output classified as “spoken” by the LSTM classifier. With the proposed method, just 51.8, 91.6, and 79.9% of the utterances in a Twitter text collection are marked as spoken utterances by the SVM, CRF, and LSTM classifiers, respectively. A baseline language model is then improved by interpolating with the one trained by these selected utterances. The proposed model is evaluated on two Thai LVCSR tasks: social media conversations and a speech-to-speech translation application. Experimental results show that all the three classification-based data selection methods clearly help reducing the overall spoken test set perplexities. Regarding the LVCSR word error rate (WER), they achieve 3.38, 3.44, and 3.39% WER reduction, respectively, over the baseline language model, and 1.07, 0.23, and 0.38% WER reduction, respectively, over the conventional perplexity-based text selection approach.

Highlights

  • Large vocabulary continuous speech recognition (LVCSR) systems play an increasingly significant role in daily life

  • The support vector machines (SVM) or conditional random fields (CRF) classifier gives for each sentence an output score indicating the degree of being a spoken style, i.e., a large score for “spoken” and a small score for “written.” In the long short-term memory neural network (LSTM) case, each sentence is directly classified into “spoken” or “written” with no score

  • 6 Conclusions In this paper, we explored the possibility of using data from social media such as Twitter to augment the lack of large text corpora for LVCSR language modeling

Read more

Summary

Introduction

Large vocabulary continuous speech recognition (LVCSR) systems play an increasingly significant role in daily life. Many commercial applications of LVCSR are widely employed, e.g., medical dictation, getting weather information, data entry, speech transcription, speech-tospeech translation, railway reservation, etc. In some systems, e.g., a speech-to-speech translation and interactive voice response (IVR) for customer service, speech input is highly conversational while it is more of a written style in medical dictation. A spoken language and a written language are different in several aspects including the word choice and the sentence structure. It is important to consider the language style for creating an efficient language model (LM) for a LVCSR system. They showed that these techniques could produce a better domain-specific LM than that by random data selection

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.