Abstract
This paper presents the results of a study on modeling the highly inflective Slovenian language. We focus on creating a language model for a large vocabulary speech recognition system. A new data-driven method is proposed for the induction of inflectional morphology into language modeling. The research focus is on data sparsity, which results from the complex morphology of the language. The idea of using subword units is examined. An attempt is made to figure out the segmentation of words into two subword units: stems and endings. No prior knowledge of the language is used. The subword units should fit into the frameworks of the probabilistic language models. A morphologically correct decomposition of words is not being sought, but searching for a decomposition which yields the minimum entropy of the training corpus. This entropy is approximated by using N-gram models. Despite some seemingly over-simplified assumption, the subword models improve the applicability of the language models for a sparse training corpus. The experiments were performed using the VEČER newswire text corpus as a training corpus. The test set was taken from the SNABI speech database, because the final models were evaluated in speech recognition experiments on SNABI speech database. Two different subword-based models are proposed and examined experimentally. The experiments demonstrate that subword-based models, which considerably reduce OOV rate, improve speech recognition WER when compared with standard word-based models, even though they increase test set perplexity. Subword-based models with improved perplexity, but which reduce the OOV rate much less than the previous ones, do not improve speech recognition results.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Pattern Recognition and Artificial Intelligence
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.