Abstract

The stylization of pitch contour is a primary task in the speech prosody for the development of a linguistic model. The stylization of pitch contour is performed either by statistical learning or statistical analysis. The recent statistical learning models require a large amount of data for training purposes and rely on complex machine learning algorithms. Whereas, the statistical analysis methods perform stylization based on the shape of the contour and require further processing to capture the voice intonations of the speaker. The objective of this paper is to devise a low-complexity transcription algorithm for the stylization of pitch contour based on the voice intonation of a speaker. For this, we propose to use of pitch marks as a subset of points for the stylization of the pitch contour. The pitch marks are the instance of glottal closure in a speech waveform that captures characteristics of speech uttered by a speaker. The selected subset can interpolate the shape of the pitch contour and acts as a template to capture the intonation of a speaker’s voice, which can be used for designing applications in speech synthesis and speech morphing. The algorithm balances the quality of the stylized curve and its cost in terms of the number of data points used. We evaluate the performance of the proposed algorithm using the mean square error and the number of lines used for fitting the pitch contour. Furthermore, we perform a comparison with other existing stylization algorithms using the LibriSpeech ASR corpus.

Highlights

  • Speech prosody represents the pitch contour of a voice signal and can be used for the construction of linguistic models and their interaction with other linguistic domains, such as morphing and speech transformation [1]

  • The pitch contours are used for learning generative models for text-tospeech synthesis applications [2], language identification [3], emotion prediction and for forensics research [4]

  • In order to remove the variability in the pitch contour, stylization is used to encode the contour into meaningful labels [6] or templates [7] for speech synthesis application

Read more

Summary

Introduction

Speech prosody represents the pitch contour of a voice signal and can be used for the construction of linguistic models and their interaction with other linguistic domains, such as morphing and speech transformation [1]. The pitch contours are used for learning generative models for text-tospeech synthesis applications [2], language identification [3], emotion prediction and for forensics research [4]. In order to remove the variability in the pitch contour, stylization is used to encode the contour into meaningful labels [6] or templates [7] for speech synthesis application. The stylization of pitch contour either uses statistical learning or statistical analysis models. The pitch contour is decomposed into a set of previously defined functions such as polynomial [9],

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call