Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation

F Batista,I Trancoso,A I Mata,H Moniz,N Mamede

doi:10.20396/joss.v2i2.15035

Abstract

This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different linguistic and paralinguistic layers of knowledge for a more complete view of their representation and interactions in several domains and languages. The processing chain is composed of two main stages, where the first consists of integrating the relevant manual annotations in the speech recognition data, and the second consists of further enriching the previous output in order to accommodate prosodic information. The described framework has been used for the identification and analysis of structural metadata in automatic speech transcripts. Initially put to use for automatic detection of punctuation marks and for capitalization recovery from speech data, it has also been recently used for studying the characterization of disfluencies in speech. It was already applied to several domains of Portuguese corpora, and also to English and Spanish Broadcast News corpora.

Highlights

Automatic speech recognition systems (ASR) are being applied to a vast number of speech sources, such as radio or TV broadcasts, interviews, e-learning classes
Manual transcripts may include a wide range of additional information for a given speech region, such as: speaker id, speaker gender, focus conditions, sections to be excluded from evaluation, segmentation information, punctuation marks, capitalization, metadata indicating the presence of foreign languages, and other phenomena, such as disfluency marking
The proposed framework aims at producing self-contained datasets that can provide the information given by the ASR system, and all the required reference data and other relevant information that can be computed from the speech signal

Summary

Introduction

Automatic speech recognition systems (ASR) are being applied to a vast number of speech sources, such as radio or TV broadcasts, interviews, e-learning classes. Such as: speaker diarization, which consists of assigning the different parts of the speech to the corresponding speakers; sentence segmentation or sentence boundary detection; punctuation recovery; capitalization recovery; and disfluency detection and filtering Such metadata extraction/annotation technologies are recently receiving increasing importance [1,2,3], and demand multi-layered linguistic information to perform such tasks. A Maximum Entropy (ME) based method is described by [8] for inserting punctuation marks into spontaneous conversational speech, where the punctuation task is considered as a tagging task and words are tagged with the appropriate punctuation It covers three punctuation marks: comma, full stop, and question mark; and the best results on the ASR output are achieved by combining lexical and prosodic features.

Current scope and data

Integrating reference data in the ASR output

Capitalization alignment

Punctuation Alignment

Disfluencies and other events

Adding Phone Information

Marking the syllable boundaries and stress

Extracting Pitch and Energy

Producing the final XML file

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Speech Sciences

Lead the way for us

Journal: Journal of Speech Sciences	Publication Date: Feb 4, 2021
License type: CC BY 4.0

Similar Papers

Impact of spontaneous speech features on business concept detection
Charlotte Danesi ... Chloé Clavel
-
Charlotte Danesi, et. al.Charlotte Danesi ... Chloé Clavel
29 Oct 2010
29 Oct 2010

A comparative study using manual and automatic transcriptions for diarization
L Canseco ... L Lamel
-
L Canseco, et. al.L Canseco ... L Lamel
01 Jan 2004
01 Jan 2004

Automatic transcription and segmentation accuracy of dyslexic children’s speech
Husniza Husni ... Yuhanis Yusof
-
Husniza Husni, et. al.Husniza Husni ... Yuhanis Yusof
01 Jan 2017
01 Jan 2017

Automatic transcription and speech recognition of Romanian corpus RO-GRID
Mircea Giurgiu ... Ahsanul Kabir
-
Mircea Giurgiu, et. al.Mircea Giurgiu ... Ahsanul Kabir
01 Jul 2012
01 Jul 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Speech Sciences