Statistical analysis of Polish language corpus for speech recognition application

Piotr Klosowski

doi:10.1109/spa.2016.7763632

Abstract

This article presents the original results of statistical analysis of Polish language, based on orthographic and phonemic language corpus, preformed by the author. The phonemic language corpus for Polish was developed, by automatic grapheme-to-phoneme conversion of source orthographic language corpus. Phonemic language corpus contains the most frequently used Polish words written with the use of phonemic notation. Performed statistical analysis of Polish language based on phonemic language corpus, includes frequency of occurrence calculation of the orthographic and phonemic language components and their sequence. The statistical language data obtained as a result of performed statistical analysis enable to develop statistical word-based and subword-based language models for Polish language. The development of statistical methods of speech recognition may be very useful in research on improving automatic speech recognition in Polish, based on statistical language models.

Full Text