Abstract

This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different linguistic and paralinguistic layers of knowledge for a more complete view of their representation and interactions in several domains and languages. The processing chain is composed of two main stages, where the first consists of integrating the relevant manual annotations in the speech recognition data, and the second consists of further enriching the previous output in order to accommodate prosodic information. The described framework has been used for the identification and analysis of structural metadata in automatic speech transcripts. Initially put to use for automatic detection of punctuation marks and for capitalization recovery from speech data, it has also been recently used for studying the characterization of disfluencies in speech. It was already applied to several domains of Portuguese corpora, and also to English and Spanish Broadcast News corpora.

Highlights

  • Automatic speech recognition systems (ASR) are being applied to a vast number of speech sources, such as radio or TV broadcasts, interviews, e-learning classes

  • Manual transcripts may include a wide range of additional information for a given speech region, such as: speaker id, speaker gender, focus conditions, sections to be excluded from evaluation, segmentation information, punctuation marks, capitalization, metadata indicating the presence of foreign languages, and other phenomena, such as disfluency marking

  • The proposed framework aims at producing self-contained datasets that can provide the information given by the ASR system, and all the required reference data and other relevant information that can be computed from the speech signal

Read more

Summary

Introduction

Automatic speech recognition systems (ASR) are being applied to a vast number of speech sources, such as radio or TV broadcasts, interviews, e-learning classes. Such as: speaker diarization, which consists of assigning the different parts of the speech to the corresponding speakers; sentence segmentation or sentence boundary detection; punctuation recovery; capitalization recovery; and disfluency detection and filtering Such metadata extraction/annotation technologies are recently receiving increasing importance [1,2,3], and demand multi-layered linguistic information to perform such tasks. A Maximum Entropy (ME) based method is described by [8] for inserting punctuation marks into spontaneous conversational speech, where the punctuation task is considered as a tagging task and words are tagged with the appropriate punctuation It covers three punctuation marks: comma, full stop, and question mark; and the best results on the ASR output are achieved by combining lexical and prosodic features.

Current scope and data
Integrating reference data in the ASR output
Capitalization alignment
Punctuation Alignment
Disfluencies and other events
Adding Phone Information
Marking the syllable boundaries and stress
Extracting Pitch and Energy
Producing the final XML file
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call