Classification-based spoken text selection for LVCSR language modeling

Vataya Chunwijitra,Chai Wutiwiwatchai

doi:10.1186/s13636-017-0121-5

Abstract

Large vocabulary continuous speech recognition (LVCSR) has naturally been demanded for transcribing daily conversations, while developing spoken text data to train LVCSR is costly and time-consuming. In this paper, we propose a classification-based method to automatically select social media data for constructing a spoken-style language model in LVCSR. Three classification techniques, SVM, CRF, and LSTM, trained by words and parts-of-speech are comparatively experimented to identify the degree of spoken style in each social media sentence. Spoken-style utterances are chosen by incremental greedy selection based on the score of the SVM or the CRF classifier or the output classified as “spoken” by the LSTM classifier. With the proposed method, just 51.8, 91.6, and 79.9% of the utterances in a Twitter text collection are marked as spoken utterances by the SVM, CRF, and LSTM classifiers, respectively. A baseline language model is then improved by interpolating with the one trained by these selected utterances. The proposed model is evaluated on two Thai LVCSR tasks: social media conversations and a speech-to-speech translation application. Experimental results show that all the three classification-based data selection methods clearly help reducing the overall spoken test set perplexities. Regarding the LVCSR word error rate (WER), they achieve 3.38, 3.44, and 3.39% WER reduction, respectively, over the baseline language model, and 1.07, 0.23, and 0.38% WER reduction, respectively, over the conventional perplexity-based text selection approach.

Highlights

Large vocabulary continuous speech recognition (LVCSR) systems play an increasingly significant role in daily life
The support vector machines (SVM) or conditional random fields (CRF) classifier gives for each sentence an output score indicating the degree of being a spoken style, i.e., a large score for “spoken” and a small score for “written.” In the long short-term memory neural network (LSTM) case, each sentence is directly classified into “spoken” or “written” with no score
6 Conclusions In this paper, we explored the possibility of using data from social media such as Twitter to augment the lack of large text corpora for LVCSR language modeling

Summary

Introduction

Large vocabulary continuous speech recognition (LVCSR) systems play an increasingly significant role in daily life. Many commercial applications of LVCSR are widely employed, e.g., medical dictation, getting weather information, data entry, speech transcription, speech-tospeech translation, railway reservation, etc. In some systems, e.g., a speech-to-speech translation and interactive voice response (IVR) for customer service, speech input is highly conversational while it is more of a written style in medical dictation. A spoken language and a written language are different in several aspects including the word choice and the sentence structure. It is important to consider the language style for creating an efficient language model (LM) for a LVCSR system. They showed that these techniques could produce a better domain-specific LM than that by random data selection

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Oct 17, 2017
Citations: 2	License type: open-access

R Discovery Prime

R Discovery Prime

Classification-based spoken text selection for LVCSR language modeling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

Using different acoustic, lexical and language modeling units for ASR of an under-resourced language – Amharic
Martha Yifiru Tachbelie ... Laurent Besacier
Speech Communication | VOL. 56
Martha Yifiru Tachbelie, et. al.Martha Yifiru Tachbelie ... Laurent Besacier
14 Feb 2013
Speech Communication | VOL. 56

Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR
Tara N Sainath ... Dimitri Kanevsky
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 19
Tara N Sainath, et. al.Tara N Sainath ... Dimitri Kanevsky
01 Nov 2011
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 19

Discriminative Language Model With Part-of-speech for Mandarin Large Vocabulary Continuous Speech Recognition System
Jielin Pan ... Zhen Zhang
-
Jielin Pan, et. al.Jielin Pan ... Zhen Zhang
01 Jan 2013
01 Jan 2013

Morpheme concatenation approach in language modeling for large-vocabulary Uyghur speech recognition
Mijit Ablimit ... Tatsuya Kawahara
-
Mijit Ablimit, et. al.Mijit Ablimit ... Tatsuya Kawahara
01 Oct 2011
01 Oct 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Classification-based spoken text selection for LVCSR language modeling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing