Heterophonic speech recognition using composite phones.

Ashraf Alkhairy,Afshan Jafri

doi:10.1186/s40064-016-3332-9

Ashraf Alkhairy, Afshan Jafri

Open Access

https://doi.org/10.1186/s40064-016-3332-9

Copy DOI

Abstract

Heterophones pose challenges during training of automatic speech recognition (ASR) systems because they involve ambiguity in the pronunciation of an orthographic representation of a word. Heterophones are words that have the same spelling but different pronunciations. This paper addresses the problem of heterophonic languages by developing the concept of a Composite Phoneme (CP) as a basic pronunciation unit for speech recognition. A CP is a set of alternative sequences of phonemes. CP’s are developed specifically in the context of Arabic by defining phonetic units that are consonant centric and absorb phonemically contrastive short vowels and gemination, not represented in the Arabic Modern Orthography (MO). CPs alleviate the need to diacritize MO into Classical Orthography (CO), to represent short vowels and stress, before generating pronunciation in terms of Simple Phonemes (SP). We develop algorithms to generate CP pronunciation from MO, and SP pronunciation from CO to map a word into a single pronunciation. We investigate the performance of CP, SP, UG (Undiacritized Grapheme), and DG (Diacritized Grapheme) ASRs. The experimental results suggest that UG and DG are inferior to SP and CP. For the A-SpeechDB corpus with MO vocabulary of 8000, the WER for bigram and context dependent phone are: 11.78, 12.64, and 13.59 % for CP, SP_M (SP from manual diacritized CO), and SP_A (SP from automated diacritized MO) respectively. For vocabulary of 24,000 MO words, the corresponding WER’s are 13.69, 15.08, and 16.86 %. For uniform statistical model, SP has a lower WER than CP. For context independent phone (CI), CP has lower WER than SP.

Highlights

A standard automatic speech recognition (ASR) system consists of a language model (LM) that governs the sequence of words in an utterance, a dictionary that maps words into sequences of pronunciation units, and Hidden Markov Models (HMMs) corresponding to pronunciation units that stochastically model acoustic events (Huang and Acero 2001)
For languages with a deep orthography, the Grapheme approach can be inferior to the Simple Phonemes (SP) (Simple Phoneme) approach by up to 10 % word error rate (WER), depending on task and complexity of mapping from orthography to pronunciation (Kanthak and Ney 2002; Magimai-Doss et al 2003a, b)
To address the problem of HMM training for heterophonic languages, this paper develops the concept of a Composite Phoneme (CP)—a set of alternative sequences of phonemes, such as a syllable with multiple vowels choices

Summary

Introduction

A standard automatic speech recognition (ASR) system consists of a language model (LM) that governs the sequence of words in an utterance, a dictionary that maps words into sequences of pronunciation units, and Hidden Markov Models (HMMs) corresponding to pronunciation units that stochastically model acoustic events (Huang and Acero 2001). The pronunciation units could correspond to either a phoneme (an individual phonetic segment) or to a syllable (a structured sequence of phonemes), with the choice depending on characteristics of a given language’s phonological system. In an ASR system that uses the phoneme, phonetic segments could correspond to O(N) context independent Monophones for a language with. Training of HMMs is conducted on phonetic transcriptions of speech utterances, which are derived from orthographic transcriptions using ortho-phonetic mapping. One important problem that arises during the training phase is the ambiguity posed by heterophones—words that have the same orthographic representation but different pronunciations (e.g., in English, the noun “bow”, referring to a weapon, and the verb “bow”, referring to a gesture of respect) (Wikipedia 2016)

Objectives

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Heterophonic speech recognition using composite phones.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SpringerPlus

Lead the way for us

Journal: SpringerPlus	Publication Date: Nov 24, 2016
License type: CC BY 4.0

Similar Papers

Using Auxiliary Sources of Knowledge for Automatic Speech Recognition

-

01 Jan 2004
01 Jan 2004

Combined speech enhancement and auditory modelling for robust distributed speech recognition
Ronan Flynn ... Edward Jones
Speech Communication | VOL. 50
Ronan Flynn, et. al.Ronan Flynn ... Edward Jones
20 May 2008
Speech Communication | VOL. 50

Interaction between people with dysarthria and speech recognition systems: A review
Aisha Jaddoh ... Omer Rana
Assistive Technology | VOL. 35
Aisha Jaddoh, et. al.Aisha Jaddoh ... Omer Rana
16 Apr 2022
Assistive Technology | VOL. 35

Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling
G Thimmaraja Yadava ... H S Jayanna
International Journal of Speech Technology | VOL. 23
G Thimmaraja Yadava, et. al.G Thimmaraja Yadava ... H S Jayanna
22 Jan 2020
International Journal of Speech Technology | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Heterophonic speech recognition using composite phones.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SpringerPlus