Vocal Emotion Research Articles

Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature.

Normally-hearing (NH) listeners rely more on prosodic cues than on lexical-semantic cues for emotion perception in speech. In everyday spoken communication, the ability to decipher conflicting information between prosodic and lexical-semantic cues to emotion can be important: for example, in identifying sarcasm or irony. Speech degradation in cochlear implants (CIs) can be sufficiently overcome to identify lexical-semantic cues, but the distortion of voice pitch cues makes it particularly challenging to hear prosody with CIs. The purpose of this study was to examine changes in relative reliance on prosodic and lexical-semantic cues in NH adults listening to spectrally degraded speech and adult CI users. We hypothesized that, compared with NH counterparts, CI users would show increased reliance on lexical-semantic cues and reduced reliance on prosodic cues for emotion perception. We predicted that NH listeners would show a similar pattern when listening to CI-simulated versions of emotional speech. Sixteen NH adults and 8 postlingually deafened adult CI users participated in the study. Sentences were created to convey five lexical-semantic emotions (angry, happy, neutral, sad, and scared), with five sentences expressing each category of emotion. Each of these 25 sentences was then recorded with the 5 (angry, happy, neutral, sad, and scared) prosodic emotions by 2 adult female talkers. The resulting stimulus set included 125 recordings (25 Sentences × 5 Prosodic Emotions) per talker, of which 25 were congruent (consistent lexical-semantic and prosodic cues to emotion) and the remaining 100 were incongruent (conflicting lexical-semantic and prosodic cues to emotion). The recordings were processed to have 3 levels of spectral degradation: full-spectrum, CI-simulated (noise-vocoded) to have 8 channels and 16 channels of spectral information, respectively. Twenty-five recordings (one sentence per lexical-semantic emotion recorded in all five prosodies) were used for a practice run in the full-spectrum condition. The remaining 100 recordings were used as test stimuli. For each talker and condition of spectral degradation, listeners indicated the emotion associated with each recording in a single-interval, five-alternative forced-choice task. The responses were scored as proportion correct, where "correct" responses corresponded to the lexical-semantic emotion. CI users heard only the full-spectrum condition. The results showed a significant interaction between hearing status (NH, CI) and congruency in identifying the lexical-semantic emotion associated with the stimuli. This interaction was as predicted, that is, CI users showed increased reliance on lexical-semantic cues in the incongruent conditions, while NH listeners showed increased reliance on the prosodic cues in the incongruent conditions. As predicted, NH listeners showed increased reliance on lexical-semantic cues to emotion when the stimuli were spectrally degraded. The present study confirmed previous findings of prosodic dominance for emotion perception by NH listeners in the full-spectrum condition. Further, novel findings with CI patients and NH listeners in the CI-simulated conditions showed reduced reliance on prosodic cues and increased reliance on lexical-semantic cues to emotion. These results have implications for CI listeners' ability to perceive conflicts between prosodic and lexical-semantic cues, with repercussions for their identification of sarcasm and humor. Understanding instances of sarcasm or humor can impact a person's ability to develop relationships, follow conversation, understand vocal emotion and intended message of a speaker, following jokes, and everyday communication in general.

Vocal Emotion Research Articles

Related Topics

Articles published on Vocal Emotion

Parameter-Specific Morphing Reveals Contributions of Timbre to the Perception of Vocal Emotions in Cochlear Implant Users.

Functional patterns of neural activation during vocal emotion recognition in youth with and without refractory epilepsy

Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody

Vocal emotion adaptation aftereffects within and across speaker genders: Roles of timbre and fundamental frequency

Extraction and Utilization of Excitation Information of Speech: A Review

Vocal communication across cultures: theoretical and methodological issues.

Emotion Recognition of Foreign Language Teachers in College English Classroom Teaching.

Associations between vocal emotion recognition and socio-emotional adjustment in children.

Impaired emotion perception and categorization in semantic aphasia

Age and sex effects in emotional prosody processing revealed in infants' mismatch responses but not in preferential looking time

Attention to voices is increased in non-clinical auditory verbal hallucinations irrespective of salience

Carrots or sticks in debt collection services? A voice metrics and text analysis of debt collection calls

Auditory deviance detection and involuntary attention allocation in occupational burnout-A follow-up study.

Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations

Age-Related Changes in Voice Emotion Recognition by Postlingually Deafened Listeners With Cochlear Implants.

The Neural Processing of Vocal Emotion After Hearing Reconstruction in Prelingual Deaf Children: A Functional Near-Infrared Spectroscopy Brain Imaging Study.

Autism, music and Alexithymia: A musical intervention to enhance emotion recognition in adolescents with ASD

Weighting of Prosodic and Lexical-Semantic Cues for Emotion Identification in Spectrally Degraded Speech and With Cochlear Implants.

The perceived salience of vocal emotions is dampened in non-clinical auditory verbal hallucinations

Mothers' discourse during shared reading of books relating to ‘positive’ and ‘negative’ emotions in different genres

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Vocal Emotion Research Articles

Related Topics

Articles published on Vocal Emotion

Parameter-Specific Morphing Reveals Contributions of Timbre to the Perception of Vocal Emotions in Cochlear Implant Users.

Functional patterns of neural activation during vocal emotion recognition in youth with and without refractory epilepsy

Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody

Vocal emotion adaptation aftereffects within and across speaker genders: Roles of timbre and fundamental frequency

Extraction and Utilization of Excitation Information of Speech: A Review

Vocal communication across cultures: theoretical and methodological issues.

Emotion Recognition of Foreign Language Teachers in College English Classroom Teaching.

Associations between vocal emotion recognition and socio-emotional adjustment in children.

Impaired emotion perception and categorization in semantic aphasia

Age and sex effects in emotional prosody processing revealed in infants' mismatch responses but not in preferential looking time

Attention to voices is increased in non-clinical auditory verbal hallucinations irrespective of salience

Carrots or sticks in debt collection services? A voice metrics and text analysis of debt collection calls

Auditory deviance detection and involuntary attention allocation in occupational burnout-A follow-up study.

Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations

Age-Related Changes in Voice Emotion Recognition by Postlingually Deafened Listeners With Cochlear Implants.

The Neural Processing of Vocal Emotion After Hearing Reconstruction in Prelingual Deaf Children: A Functional Near-Infrared Spectroscopy Brain Imaging Study.

Autism, music and Alexithymia: A musical intervention to enhance emotion recognition in adolescents with ASD

Weighting of Prosodic and Lexical-Semantic Cues for Emotion Identification in Spectrally Degraded Speech and With Cochlear Implants.

The perceived salience of vocal emotions is dampened in non-clinical auditory verbal hallucinations

Mothers' discourse during shared reading of books relating to ‘positive’ and ‘negative’ emotions in different genres