Abstract

Today, a large amount of audio data is available on the web in the form of audiobooks, podcasts, video lectures, video blogs, news bulletins, etc. In addition, we can effortlessly record and store audio data such as a read, lecture, or impromptu speech on handheld devices. These data are rich in prosody and provide a plethora of voices to choose from, and their availability can significantly reduce the overhead of data preparation and help rapid building of synthetic voices. But, a few problems are associated with readily using this data such as (1) these audio files are generally long, and audio-transcription alignment is memory intensive; (2) precise corresponding transcriptions are unavailable, (3) many times, no transcriptions are available at all; (4) the audio may contain dis-fluencies and non-speech noises, since they are not specifically recorded for building synthetic voices; and (5) if we obtain automatic transcripts, they will not be error free. Earlier works on long audio alignment addressing the first and second issue generally preferred reasonable transcripts and mainly focused on (1) less manual intervention, (2) mispronunciation detection, and (3) segmentation error recovery. In this work, we use a large vocabulary public domain automatic speech recognition (ASR) system to obtain transcripts, followed by confidence measure-based data pruning which together address the five issues with the found data and also ensure the above three points. For proof of concept, we build voices in the English language using an audiobook (read speech) in a female voice from LibriVox and a lecture (spontaneous speech) in a male voice from Coursera, using both reference and hypotheses transcriptions, and evaluate them in terms of intelligibility and naturalness with the help of a perceptual listening test on the Blizzard 2013 corpus.

Highlights

  • 1.1 Motivation Unit selection speech synthesis is one of the techniques for synthesizing speech, where appropriate units from a database of natural speech are selected and concatenated [1,2,3]

  • CMU ARCTIC, these data are rich in prosody and provide a plethora of voices to choose from, and their use can significantly ease the overhead of data preparation allowing to rapidly build general-purpose naturalsounding synthetic voices

  • In [13], they quantified the number of insertions, substitutions, and deletions made by the volunteer who read the book “A Tramp Abroad” by Mark Twain and proposed a lightly supervised approach that accounts for these differences between the audio and text

Read more

Summary

Introduction

1.1 Motivation Unit selection speech synthesis is one of the techniques for synthesizing speech, where appropriate units from a database of natural speech are selected and concatenated [1,2,3]. Unit selection synthesis can produce naturalsounding and expressive speech output given a large amount of data containing various prosodic and spectral characteristics. As a result, it is used in several commercial text-to-speech (TTS) applications today

Overhead of data preparation for building general-purpose synthetic voices
Problems with using found data for building synthetic voices
Acoustic model-to-audio alignment
Text-to-text alignment
Data preparation
ASR and TTS systems
Feature extraction
Acoustic feature extraction
Pre-clustering units
Prediction of phrase-break locations
Determination of units
Predicting the durations of units
Selection of units:
Relevance of posterior probability and unit duration as confidence features
Computation of posterior probability and unit durational zscore
Experiments and evaluation
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call