Query language for access to speech corpora

Andreas Mengel,Ulrich Heid

doi:10.1121/1.425122

Abstract

Typically, speech corpora are designed for the study of speech, its patterns, and their influences. Corpora for synthesis support the extraction and reuse of units in concatenative speech synthesis systems. Recent approaches consider speech segments of variable length, to enhance naturalness. The development of spoken language dialog systems (SLDSs) and similar applications requires dialog corpora with multiple information, such as prosodic labeling, grammatical annotations, dialog acts, etc. For constructing such multilevel corpora and, particularly, for retrieving useful information from there, three aspects are crucial: A theory for the description of the corpora, an effective markup and methods for specifying, and efficiently retrieving relevant portions of the data. MATE (EU Telematics Project LE4-8370) proposes standards for an integrated and consistent multilevel annotation based on the existing TEI standard (Text Encoding Initiative). For the access to massively marked-up speech data, a query language and a query engine have been developed. This allows the re-trieval of any combination of data, with linguistic annotations from different levels. The presentation will describe and demonstrate the query language.

Full Text