Abstract

Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech recognition system, often a wrong word ending occurs, which is nevertheless counted as an error. This effect is larger for contexts not present in the language model training corpus. In this manuscript, an approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. These categories include word type, word case, grammatical number, and gender, and they were all assigned to words in the system vocabulary, where applicable. These additional word features helped to produce significant improvements in relation to the baseline system, both for n-gram-based and neural network-based language models. The proposed system can help overcome a lot of tedious errors in a large vocabulary system, for example, for dictation, both for Serbian and for other languages with similar characteristics.

Highlights

  • Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. is kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech recognition system, often a wrong word ending occurs, which is counted as an error. is effect is larger for contexts not present in the language model training corpus

  • An approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. ese categories include word type, word case, grammatical number, and gender, and they were all assigned to words in the system vocabulary, where applicable. ese additional word features helped to produce significant improvements in relation to the baseline system, both for n-gram-based and neural network-based language models. e proposed system can help overcome a lot of tedious errors in a large vocabulary system, for example, for dictation, both for Serbian and for other languages with similar characteristics

  • The best language models in existence were statistical models based on n-grams—frequencies or probabilities of individual word sequences up to and including length n [2]. ese LMs proved to be highly effective for an array of applications, even though they had several known problems, e.g., data sparsity and modeling of longer contexts

Read more

Summary

Research Article

Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition. Is effect is larger for contexts not present in the language model training corpus In this manuscript, an approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. Numbers, and genders do not apply to invariable words (prepositions, adverbs, conjunctions, particles, and exclamations), even though certain prepositions are always followed by certain cases In this manuscript, incorporation of the mentioned morphological features into both n-gram based and RNNbased language models for Serbian is examined, and the obtained results are presented on the largest Serbian audio database for acoustic modeling, as well as all the currently available textual materials in Serbian for language model training. Another approach is using factored language models (FLMs) [11], which explicitly model relationships between morphological and lexical items in a single language model, and a generalised back-off procedure is used during training to improve the robustness of the resulting FLM during

Possible category values
Materials and Methods
Total For training
Corpus part
Result
Findings
DEV base DEV with POS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call