Abstract

This paper concerns the development of an automatic speech recognition system for the Slovenian language. The large number of unique words in inflected languages is identified as the primary reason for performance degradation. This article discusses statistical language models. A novel variation of the n-gram modelling theme is examined. Modelling units are chosen to be stems and endings instead of words. Only data-driven algorithms are employed to decompose words into stems and endings automatically. Significant reduction of OOV rate results when using stems and endings for modeling the Slovenian language. We also discuss corpus-based topic-adapted language models. Language models are most often used in a homogeneous topic environment. The problem of topic detection in highly inflected language is outlined, caused by the appearance of several word forms derived from the same lemma. The problem is solved by using data-driven algorithms to group words of the same lemma into classes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.