Abstract

The incorporation of grammatical information into speech recognition systems is often used to increase performance in morphologically rich languages. However, this introduces demands for sufficiently large training corpora and proper methods of using the additional information. In this paper, we present a method for building factored language models that use data obtained by morphosyntactic tagging. The models use only relevant factors that help to increase performance and ignore data from other factors, thus also reducing the need for large morphosyntactically tagged training corpora. Which data is relevant is determined at run-time, based on the current text segment being estimated, i.e., the context. We show that using a context-dependent model in a two-pass recognition algorithm, the overall speech recognition accuracy in a Broadcast News application improved by 1.73% relatively, while simpler models using the same data achieved only 0.07% improvement. We also present a more detailed error analysis based on lexical features, comparing first-pass and second-pass results.

Highlights

  • Speech recognition still performs poorly in inflectional languages compared to mainstream languages like English

  • 6 Experimental system We evaluated the performance of the proposed language models on a large vocabulary continuous speech recognition (LVCSR) application in an inflective language, namely a Slovene Broadcast News transcription

  • Results show that real-time factors (RTF) increases by a factor of 2 if the vocabulary size is increased from 60K to 300K and by a factor of 3 if when using trigram models compared to bigram models

Read more

Summary

Introduction

Speech recognition still performs poorly in inflectional languages compared to mainstream languages like English. We can build FLMs with a limited number of factors in each probability estimation, which can improve recognition performance while avoiding new data sparsity problems This method makes use of grammatical properties of the target language, as the process of determining the backoff path searches for specific correlations in a given sentence structure. Sak [11] showed for the Turkish language that the improvements achieved by using FLMs rather than traditional n-gram models are greater while only limited size corpora are available for training; the improvements decreased as the corpus size increased These results can be considered relevant to speech recognition in specific domains with limited training data, where data sparsity becomes a problem for word-based models. The sentence was taken from the BNSI test set, later used in this research

Factored language models
Initial algorithm
Results
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.