Abstract

This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish. The author used previously developed rules and independently developed the grapheme-to-phoneme conversion algorithm.The algorithm has been implemented as a software application called TransFon, which allows the user to convert any text in Polish orthography to corresponding strings of phonemes, in phonemic transcription. Using TransFon, a phonemic Polish language corpus was created out of an orthographic corpus. The phonemic language corpusallows statistical analysis of the Polish language, as well as the development of phoneme- and word-based language models for automatic speech recognition using statistical methods. The developed phonemic language corpus opens up further opportunities for research to improve automatic speech recognition in Polish. The development of statistical methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora. The method presented here enables the creation of such corpora.

Highlights

  • Natural language processing often requires grapheme-to-phoneme (G2P) conversion of an orthographic text [1]

  • This grapheme-to-phoneme conversion algorithm for Polish was implemented in the Python programming language as an independent application called TransFon [48]

  • The results of the grapheme-to-phoneme conversion research presented in this paper were compared to other results published in the literature [3,27,39,41–43,45,51,53,55,62–73]

Read more

Summary

Introduction

Natural language processing often requires grapheme-to-phoneme (G2P) conversion of an orthographic text [1]. G2P converts strings of graphemes to corresponding sequences of phonetic transcription characters, directly from orthographic representations and it is crucial for many applications in various areas of speech and language processing [2]. Tools for converting graphemes to phonemes are used in theoretical and applied linguistics Such tools are useful in many areas of linguistic research (e.g., phonetics, phonology, dialectology, and language acquisition), in order to obtain preliminary phonetic transcriptions of large language corpora [4]. The main goal of research on the conversion of graphemes to phonemes is improving speech recognition for the Polish language [5,6]. The development of statistical and deep learning methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora [18–25]. The main motivation for undertaking this research on automatic grapheme-to-phoneme conversion and its application, was the development of effective methods of creating a phonemic language corpus for Polish, comprised of phonemic transcriptions derived from an orthographic language corpus through graphemeto-phoneme conversion

Problem Formulation
Methodology
Conversion Rules
Conversion Algorithm
Results
The PER value for unique words 4 The PER value for the corpus
Statistical Approach
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call