Prosody prediction for arabic via the open-source boundary-annotated qur’an corpus

M S Sawalha,E Atwell,C Brierley

doi:10.20396/joss.v2i2.15038

M S Sawalha, E Atwell + Show 1 more

Open Access

https://doi.org/10.20396/joss.v2i2.15038

Copy DOI

Journal: Journal of Speech Sciences	Publication Date: Feb 4, 2021
Citations: 6	License type: CC BY 4.0

Affiliation: University of Leeds

Abstract

humans or machines. To develop phrase break classifiers, we need a boundary-annotated and part-ofspeech tagged corpus. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener; manual annotation must be done by an expert linguist. For Arabic, there are no existing suitable resources. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwid (recitation) mark-up in the Qur’an which we then interpret as additional text-based data for computational analysis. This mark-up is prescriptive, and signifies a widely-used recitation style, and one of seven original styles of transmission. Here we report on version 1.0 of our Boundary-Annotated Qur’an dataset of 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. We then use this dataset to train, test, and compare two probabilistic taggers (trigram and HMM) for Arabic phrase break prediction, where the task is to predict boundary locations in an unseen test set stripped of boundary annotations by classifying words as breaks or non-breaks. The preponderance of non-breaks in the training data sets a challenging baseline success rate: 85.56%. However, we achieve significant gains in accuracy with a trigram tagger, and significant gains in performance recognition of minority class instances with both taggers via the Balanced Classification Rate metric. This is initial work on a longterm research project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.

Highlights

An accepted Universal of language is that people process speech in chunks (1), which in turn can be interpreted syntactically as function word groups (2) and prosodically as tone units (3, 4)
Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener; manual annotation must be done by an expert English linguist
The immediate research question pertaining to this study is: Can we successfully recapture prosodic boundaries authenticated by Tajwid recitation markup using probabilistic taggers trained and tested on our Boundary-Annotated Qur’an Corpus?

Summary

Introduction

An accepted Universal of language is that people process speech (and text) in chunks (1), which in turn can be interpreted syntactically as function word groups (2) and prosodically as tone units (3, 4). A phrase break classifier is needed to predict natural chunks in text to be read out loud by humans or machines. Phrase break prediction is a classification task within the Text-to-Speech synthesis pipeline that attempts to simulate human chunking strategies by assigning prosodic-syntactic boundaries to input text. To develop phrase break classifiers, we need a boundary-annotated and part-of-speech tagged corpus. For Modern Arabic, there are no existing suitable resources with prosodic phrase boundaries annotated by Arabic linguistics experts. The Qur’an can be used as a reputable “gold standard” for phrasing in Arabic, because traditional editions include boundary mark-up to aid correct recitation, based on long-established traditions of Quranic Arabic linguistics developed to help believers read and understand the Quran. We can harness the recitation markup in traditional Quran editions, to use these as phrase-break markup in a Boundary-Annotated Quran Corpus

Objectives

Methods

Conclusion