Automatic long audio alignment for conversational Arabic speech

Mohamed Elmahdy

doi:10.5339/qfarf.2013.ictp-03

Abstract

Long Audio Alignment is a known problem in speech processing in which the goal is to align a long audio input with the corresponding text. Accurate alignments help in many speech processing tasks such as audio indexing, speech recognizer's acoustic model training, audio summarizing and retrieving, etc. In this work, we have collected more than 1400 hours of conversational Arabic speech extracted from Al-Jazeerah podcasts besides the corresponding non-aligned text transcriptions. Podcast's length varies from 20-50 minutes each. Five episodes have been manually aligned that meant to be used in evaluating alignment accuracy. For each episode, a split and merge segmentation approach is applied to segment audio file into small segments of average length of 5 sec. having filled pauses on the boundary of each segment. A pre-processing stage in applied on the corresponding raw transcriptions to remove titles, headings, images, speaker's names, etc. A biased language model (LM) is trained on the fly using the processed text. Conversational Arabic speech is mostly spontaneous and influenced by dialectal Arabic. Since phonemic pronunciation modeling is not always possible for non-standard Arabic words, a graphemic pronunciation model (PM) is utilized to generate one pronunciation variant for each word. Unsupervised acoustic model adaptation in applied on a pre-trained Arabic acoustic model using the current podcast audio. The adapted AM along with the biased LM and the graphemic PM are used in a fast speech recognition pass applied on the current podcast's segments. Recognizer's output is aligned with the processed transcriptions using Levenshtein distance algorithm. This way we can ensure error recovery where miss-alignment of a certain segment does not affect alignment of later segments. The proposed approach resulted in an alignment accuracy of 97% on the evaluation set. Most of miss-alignment errors were found to be with segments having significant background noise (music, channel noise, cross-talk, etc.) or significant speech disfluencies (truncated words, repeated words, hesitations, etc.). For some speech processing tasks like acoustic model training, it is required to eliminate miss-aligned segments from the training data. That is why a confidence scoring metric is proposed to accept/reject aligner output. The score is provided for each segment and it is basically the Min-Edit distance between recognizer's output and the aligned text. By using confidence scores, it was possible to reject the majority of miss-aligned segments resulting in 99% alignment accuracy. This work was funded by a grant from the Qatar National Research Fund under its National Priorities Research Program (NPRP) award number NPRP 09-410-1-069. Reported experimental work was performed at Qatar University in collaboration with University of Illinois.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automatic long audio alignment for conversational Arabic speech

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Hybrid pronunciation modeling for Arabic large vocabulary speech recognition
Mohamed Elmahdy ... Mark Hasegawa-Johnson
-
Mohamed Elmahdy, et. al.Mohamed Elmahdy ... Mark Hasegawa-Johnson
01 Jan 2012
01 Jan 2012

Development of a spontaneous large vocabulary speech recognition system for Qatari Arabic
Mohamed Elmahdy
-
Mohamed ElmahdyMohamed Elmahdy
01 Jan 2013
01 Jan 2013

Expanding Bioethics Research within the Muslim Context: From a Project to a Program
Ayman Shabana
-
Ayman ShabanaAyman Shabana
01 Jan 2015
01 Jan 2015

Conversational speech recognition
Thomas H Crystal
The Journal of the Acoustical Society of America | VOL. 102
Thomas H CrystalThomas H Crystal
01 Nov 1997
The Journal of the Acoustical Society of America | VOL. 102

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic long audio alignment for conversational Arabic speech

Abstract

Talk to us

Similar Papers