HMM-based System Research Articles

While global key and chord estimation for both popular and classical music recordings have received a lot of attention, little research has been devoted to estimating the local key for classical music. Partly, this may be due to its inherent ambiguity and subjectivity, which makes annotating local keys a challenging task. In this article, we approach local key estimation with a cross-version dataset comprising nine performances (versions) of Schubert's song cycle Winterreise annotated by three different music theory experts. We consider two baseline methods that are representative for common types of signal processing algorithms: an HMM-based system and a CNN-based approach. For both models, we employ a similar training procedure including the optimization of hyperparameters on a validation split. We systematically evaluate the model predictions and provide musical explanations for key confusions. As a central contribution, we explore how different training-test splits affect the models' efficacy. Splitting along the song axis, we find that both methods perform similarly well. Splitting along the version axis, we obtain substantially higher accuracies, especially for the CNN, which seems to effectively learn the harmonic progressions of the songs (“cover song effect”) and successfully generalizes to unseen versions. We further discuss the results for several songs in detail and assess our results from the perspective of multiple annotators. This cross-annotator study reveals that a substantial part of the systems' errors coincides with annotator disagreement and an even larger part can be traced back to musically explainable relationships among different keys.

Read full abstract

High quality expressive speech synthesis has been a long-standing goal towards natural human-computer interaction. Generating a talking head which is both realistic and expressive appears to be a considerable challenge, due to both the high complexity in the acoustic and visual streams and the large non-discrete number of emotional states we would like the talking head to be able to express. In order to cover all the desired emotions, a significant amount of data is required, which poses an additional time-consuming data collection challenge. In this paper we attempt to address the aforementioned problems in an audio-visual context. Towards this goal, we propose two deep neural network (DNN) architectures for Video-realistic Expressive Audio-Visual Text-To-Speech synthesis (EAVTTS) and evaluate them by comparing them directly both to traditional hidden Markov model (HMM) based EAVTTS, as well as a concatenative unit selection EAVTTS approach, both on the realism and the expressiveness of the generated talking head. Next, we investigate adaptation and interpolation techniques to address the problem of covering the large emotional space. We use HMM interpolation in order to generate different levels of intensity for an emotion, as well as investigate whether it is possible to generate speech with intermediate speaking styles between two emotions. In addition, we employ HMM adaptation to adapt an HMM-based system to another emotion using only a limited amount of adaptation data from the target emotion. We performed an extensive experimental evaluation on a medium sized audio-visual corpus covering three emotions, namely anger, sadness and happiness, as well as neutral reading style. Our results show that DNN-based models outperform HMMs and unit selection on both the realism and expressiveness of the generated talking heads, while in terms of adaptation we can successfully adapt an audio-visual HMM set trained on a neutral speaking style database to a target emotion. Finally, we show that HMM interpolation can indeed generate different levels of intensity for EAVTTS by interpolating an emotion with the neutral reading style, as well as in some cases, generate audio-visual speech with intermediate expressions between two emotions.

Read full abstract

HMM-based System Research Articles

Related Topics

Articles published on HMM-based System

Voice Assignment in Vocal Quartets Using Deep Learning Models Based on Pitch Salience

Machine Transliteration Using SVM and HMM

Local Key Estimation in Music Recordings: A Case Study Across Songs, Versions, and Annotators

Facial expression recognition using best tree RD-LGP encoded features and HMM

Hindi named entity recognition using system combination

A segmental HMM based trajectory classification using genetic algorithm

LSTM Deep Neural Networks Postfiltering for Enhancing Synthetic Voices

Discriminative keyword spotting using triphones information and N-best search

Video-realistic expressive audio-visual speech synthesis for the Greek language

Classifier Architectures for Acoustic Scenes and Events: Implications for DNNs, TDNNs, and Perceptual Features from DCASE 2016

Helpful Statistics in Recognizing Basic Arabic Phonemes

Global Variance in Speech Synthesis With Linear Dynamical Models

Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings

An efficient hardware architecture for HMM-based TTS system

Acoustic Vowel Analysis in a Mexican Spanish HMM-based Speech Synthesis

A Text Categorization Model Based on Hidden Markov Models

Classification of Multiple Seizure-Like States in Three Different Rodent Models of Epileptogenesis

Two-pass search strategy using accumulated band energy histogram for HMM-based identification of perceptually identical music

A HMM-Based System To Diacritize Arabic Text

Unsupervised Intralingual and Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis Using Two-Pass Decision Tree Construction

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

HMM-based System Research Articles

Related Topics

Articles published on HMM-based System

Voice Assignment in Vocal Quartets Using Deep Learning Models Based on Pitch Salience

Machine Transliteration Using SVM and HMM

Local Key Estimation in Music Recordings: A Case Study Across Songs, Versions, and Annotators

Facial expression recognition using best tree RD-LGP encoded features and HMM

Hindi named entity recognition using system combination

A segmental HMM based trajectory classification using genetic algorithm

LSTM Deep Neural Networks Postfiltering for Enhancing Synthetic Voices

Discriminative keyword spotting using triphones information and N-best search

Video-realistic expressive audio-visual speech synthesis for the Greek language

Classifier Architectures for Acoustic Scenes and Events: Implications for DNNs, TDNNs, and Perceptual Features from DCASE 2016

Helpful Statistics in Recognizing Basic Arabic Phonemes

Global Variance in Speech Synthesis With Linear Dynamical Models

Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings

An efficient hardware architecture for HMM-based TTS system

Acoustic Vowel Analysis in a Mexican Spanish HMM-based Speech Synthesis

A Text Categorization Model Based on Hidden Markov Models

Classification of Multiple Seizure-Like States in Three Different Rodent Models of Epileptogenesis

Two-pass search strategy using accumulated band energy histogram for HMM-based identification of perceptually identical music

A HMM-Based System To Diacritize Arabic Text

Unsupervised Intralingual and Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis Using Two-Pass Decision Tree Construction