Short Utterances Research Articles

Text-independent speaker recognition using short utterances is a highly challenging task due to the large variation and content mismatch between short utterances. I-vector and probabilistic linear discriminant analysis (PLDA) based systems have become the standard in speaker verification applications, but they are less effective with short utterances. In this paper, we first compare two state-of-the-art universal background model (UBM) training methods for i-vector modeling using full-length and short utterance evaluation tasks. The two methods are Gaussian mixture model (GMM) based (denoted I-vector_GMM) and deep neural network (DNN) based (denoted as I-vector_DNN) methods. The results indicate that the I-vector_DNN system outperforms the I-vector_GMM system under various durations (from full length to 5 s). However, the performances of both systems degrade significantly as the duration of the utterances decreases. To address this issue, we propose two novel nonlinear mapping methods which train DNN models to map the i-vectors extracted from short utterances to their corresponding long-utterance i-vectors. The mapped i-vector can restore missing information and reduce the variance of the original short-utterance i-vectors. The proposed methods both model the joint representation of short and long utterance i-vectors: the first method trains an autoencoder first using concatenated short and long utterance i-vectors and then uses the pre-trained weights to initialize a supervised regression model from the short to long version; the second method jointly trains the supervised regression model with an autoencoder reconstructing the short utterance i-vector itself. Experimental results using the NIST SRE 2010 dataset show that both methods provide significant improvement and result in a 24.51% relative improvement in Equal Error Rates (EERs) from a baseline system. In order to learn a better joint representation, we further investigate the effect of a deep encoder with residual blocks, and the results indicate that the residual network can further improve the EERs of a baseline system by up to 26.47%. Moreover, in order to improve the short i-vector mapping to its long version, an additional vector, which represents the average value of phoneme posteriors across frames, is also added to the input, and results in a 28.43% improvement. When further testing the best-validated models of SRE10 on the Speaker In The Wild (SITW) dataset, the methods result in a 23.12% improvement on arbitrary-duration (1–5 s) short-utterance conditions.

The exaggerated intonation and special rhythmic properties of infant-directed speech (IDS) have been hypothesized to attract infants’ attention to the speech stream. However, there has been little work actually connecting the properties of IDS to models of attentional processing or perceptual learning. A number of such attention models suggest that surprising or novel perceptual inputs attract attention, where novelty can be operationalized as the statistical (un)predictability of the stimulus in the given context. Since prosodic patterns such as F0 contours are accessible to young infants who are also known to be adept statistical learners, the present paper investigates a hypothesis that F0 contours in IDS are less predictable than those in adult-directed speech (ADS), given previous exposure to both speaking styles, thereby potentially tapping into basic attentional mechanisms of the listeners in a similar manner that relative probabilities of other linguistic patterns are known to modulate attentional processing in infants and adults. Computational modeling analyses with naturalistic IDS and ADS speech from matched speakers and contexts show that IDS intonation has lower overall temporal predictability even when the F0 contours of both speaking styles are normalized to have equal means and variances. A closer analysis reveals that there is a tendency of IDS intonation to be less predictable at the end of short utterances, whereas ADS exhibits more stable average predictability patterns across the full extent of the utterances. The difference between IDS and ADS persists even when the proportion of IDS and ADS exposure is varied substantially, simulating different relative amounts of IDS heard in different family and cultural environments. Exposure to IDS is also found to be more efficient for predicting ADS intonation contours in new utterances than exposure to the equal amount of ADS speech. This indicates that the more variable prosodic contours of IDS also generalize to ADS, and may therefore enhance prosodic learning in infancy. Overall, the study suggests that one reason behind infant preference for IDS could be its higher information value at the prosodic level, as measured by the amount of surprisal in the F0 contours. This provides the first formal link between the properties of IDS and the models of attentional processing and statistical learning in the brain. However, this finding does not rule out the possibility that other differences between the IDS and ADS also play a role.

Short Utterances Research Articles

Related Topics

Articles published on Short Utterances

An Analysis of the Short Utterance Problem for Speaker Characterization

Back From the Future: Nonlinear Anticipation in Adults' and Children's Speech.

Self-attention based speaker recognition using Cluster-Range Loss

End-to-end DNN based text-independent speaker recognition for long and short utterances

Roar of a Champion: Loudness and Voice Pitch Predict Perceived Fighting Ability but Not Success in MMA Fighters

Czy łatwiej jest rozpoznać emocję na podstawie swoistej ekspresji mimicznej czy prozodii?

Identity Vector Extraction Using Shared Mixture of PLDA for Short‐Time Speaker Recognition

Quality measures for speaker verification with short utterances

Children Probably Store Short Rather Than Frequent or Predictable Chunks: Quantitative Evidence From a Corpus Study.

Short Utterance Based Speech Language Identification in Intelligent Vehicles With Time-Scale Modifications and Deep Bottleneck Features

Management of velopharyngeal insufficiency: The evolution of care and the current state of the art

СПОСОБЫ ОБРАЗОВАНИЯ АНТИПОСЛОВИЦ В ТУРЕЦКОМ ЯЗЫКЕ

Generalized Variability Model for Speaker Verification

Deep neural network based i-vector mapping for speaker verification using short utterances

Judgements of a speaker's personality are correlated across differing content and stimulus type.

Can listeners hear the difference between children with normal hearing and children with a hearing impairment?

New approach for short utterance speaker identification

GMM and CNN Hybrid Method for Short Utterance Speaker Recognition

Towards understanding speaker discrimination abilities in humans and machines for text-independent short utterances of different speech styles.

Is infant-directed speech interesting because it is surprising? – Linking properties of IDS to statistical learning and attention at the prosodic level

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Short Utterances Research Articles

Related Topics

Articles published on Short Utterances

An Analysis of the Short Utterance Problem for Speaker Characterization

Back From the Future: Nonlinear Anticipation in Adults' and Children's Speech.

Self-attention based speaker recognition using Cluster-Range Loss

End-to-end DNN based text-independent speaker recognition for long and short utterances

Roar of a Champion: Loudness and Voice Pitch Predict Perceived Fighting Ability but Not Success in MMA Fighters

Czy łatwiej jest rozpoznać emocję na podstawie swoistej ekspresji mimicznej czy prozodii?

Identity Vector Extraction Using Shared Mixture of PLDA for Short‐Time Speaker Recognition

Quality measures for speaker verification with short utterances

Children Probably Store Short Rather Than Frequent or Predictable Chunks: Quantitative Evidence From a Corpus Study.

Short Utterance Based Speech Language Identification in Intelligent Vehicles With Time-Scale Modifications and Deep Bottleneck Features

Management of velopharyngeal insufficiency: The evolution of care and the current state of the art

СПОСОБЫ ОБРАЗОВАНИЯ АНТИПОСЛОВИЦ В ТУРЕЦКОМ ЯЗЫКЕ

Generalized Variability Model for Speaker Verification

Deep neural network based i-vector mapping for speaker verification using short utterances

Judgements of a speaker's personality are correlated across differing content and stimulus type.

Can listeners hear the difference between children with normal hearing and children with a hearing impairment?

New approach for short utterance speaker identification

GMM and CNN Hybrid Method for Short Utterance Speaker Recognition

Towards understanding speaker discrimination abilities in humans and machines for text-independent short utterances of different speech styles.

Is infant-directed speech interesting because it is surprising? – Linking properties of IDS to statistical learning and attention at the prosodic level