Mapping acoustic characteristics of emotional prosody in Mandarin disyllabic words: A machine-learning approach

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

This study conducted an acoustic-prosodic mapping analysis of emotional prosody in Mandarin Chinese. It utilized a validated audiometry corpus with 450 disyllabic words. The spoken words covered five basic emotions produced by a female speaker: angry, sad, happy, fearful, and neutral. A machine-learning approach was adopted to map key acoustic-prosodic features for Mandarin emotional vocalization. The results revealed distinctive acoustic profiles for each emotion, highlighting variations in fundamental frequency, intensity, speaking rate, and voice quality. Emotional utterances consistently exhibited higher mean F0 values than neutral expressions. Fear displayed the highest crest in F0. Angry and happy utterances showed greater vocal intensity and a faster speaking rate compared to fearful and sad expressions. While anger was associated with a creaky voice quality, sadness corresponded with a breathier voice quality. The current findings are limited with the use of the single-speaker corpus. Ongoing efforts aim to expand the corpus with more speakers to test the generalizability and scalability of the analysis approach for subsequent investigations.

Similar Papers
  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.jvoice.2022.08.001
Bright Voice Quality and Fundamental Frequency Variation in Non-binary Speakers
  • Oct 6, 2022
  • Journal of voice : official journal of the Voice Foundation
  • Brown Leann + 1 more

Bright Voice Quality and Fundamental Frequency Variation in Non-binary Speakers

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.jvoice.2023.01.008
The Relationship Between Pitch Discrimination and Fundamental Frequency Variation: Effects of Singing Status and Vocal Hyperfunction
  • Feb 6, 2023
  • Journal of voice : official journal of the Voice Foundation
  • Allison S Aaron + 5 more

The Relationship Between Pitch Discrimination and Fundamental Frequency Variation: Effects of Singing Status and Vocal Hyperfunction

  • Research Article
  • Cite Count Icon 85
  • 10.1176/appi.ajgp.13.11.926
Emotion-Discrimination Deficits in Mild Alzheimer Disease
  • Nov 1, 2005
  • American Journal of Geriatric Psychiatry
  • C G Kohler

Emotion-Discrimination Deficits in Mild Alzheimer Disease

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.jvoice.2022.05.005
Quality of Voice in Patients With Partial Deafness Before and After Cochlear Implantation
  • Jun 3, 2022
  • Journal of Voice
  • Karol Myszel + 1 more

Quality of Voice in Patients With Partial Deafness Before and After Cochlear Implantation

  • Research Article
  • Cite Count Icon 9
  • 10.1016/j.specom.2016.01.002
Generating tonal distinctions in Mandarin Chinese using an electrolarynx with preprogrammed tone patterns
  • Jan 21, 2016
  • Speech Communication
  • Liana Guo + 2 more

Generating tonal distinctions in Mandarin Chinese using an electrolarynx with preprogrammed tone patterns

  • Research Article
  • Cite Count Icon 1
  • 10.21518/ms2025-029
Assessment of the possibility of acoustic voice analysis
  • May 24, 2025
  • Meditsinskiy sovet = Medical Council
  • I S Timerbulatov + 4 more

Introduction. Voice disorders occur in approximately 30% of the country’s population. The most studied characteristics of the voice include fundamental frequency, pitch and amplitude, harmonic-to-noise ratio, cepstral peak severity, acoustic quality index of voice, maximum phonation time, variations in fundamental frequency and number of pauses in speech signals.Aim. Literature review assessing the possibility of acoustic voice analysis in patients with dysphonia.Materials and methods. The authors searched for publications in the electronic databases PubMed, Web of Science, Google Scholar and ELibrary. The search was carried out using the following keywords: “voice acoustic analysis”, “voice disorder”, “artificial neural network”, “dysphonia”, “standard deviation of fundamental frequency”, “voice quality”, “acoustic voice analysis”.Results and discussion. Fundamental frequency may be more sensitive to objective clinical assessment of voice than pitch and amplitude. The severity of the cepstral peak is an integral part of the acoustic analysis of the voice, helping to determine the differences between dysphonic and normal voices. Cepstral analysis is more sensitive to subtle dysphonic changes than vowel analysis methods. Despite the high analytical accuracy, ease of use of machine learning, as well as the promise of this approach in the diagnosis of dysphonia, the clinical application of this technology requires further researchConclusions. Acoustic Analysis of Voice offers numerous advantages such as non-invasiveness, cost-effectiveness, and ease of use, facilitating the acquisition of objective data for evaluating the severity of voice disorders and serving as an indispensable tool for identifying pathologies associated with phonation disturbances. According to the literature, the most informative Acoustic Analysis of Voice parameters include fundamental frequency metrics, pitch and amplitude indices, cepstral peak prominence, voice quality index, maximum phonation time, and the relative noise level in the speech signal.

  • Research Article
  • 10.15584/sar.2023.20.10
On the role of pauses – a qualitative and quantitative analysis of selected political speeches in the European Parliament
  • Dec 29, 2023
  • Studia Anglica Resoviensia
  • Karin Semaník Miklóssiová

This paper presents an analysis of speech pauses occurring in selected political speeches, with a focus on both filled and silent types. The paper aims to highlight differences among pause categories in spoken language and assess potential gender differences. By analyzing speeches from male and female speakers in specific syntactic contexts, the paper reveals limited variations in fundamental frequency and pause duration. While certain subcategories exhibit slight differences in frequency, consistent patterns are lacking. Findings indicate that filled, hesitation, politeness, and perturbation pauses tend to be longer, whereas specification and personal stance pauses tend to have lower frequencies. Enumeration, opposition, and segmenting pauses strategically support a speaker's point, aligned with sentence structure. Investigating fundamental frequency trends before pauses demonstrates that a decrease in frequency signifies unit culmination, while an increase suggests non-final positions within units. The paper concludes that disparities exist among pause types, though differences between male and female speakers are generally minor. This emphasizes the need to consider multiple factors, including duration, frequency, and syntactic context, for comprehensive pause definitions. Overall, the paper provides insight into speech pause attributes, variations, and their significance in conveying meaning, thus enriching our understanding of speech patterns and communication strategies.

  • Research Article
  • Cite Count Icon 1
  • 10.1121/10.0035084
The neural representation of emotional cues investigated using the speech frequency following response: A potential tool to evaluate speech prosody
  • Oct 1, 2024
  • The Journal of the Acoustical Society of America
  • Maryam Karimi Boroujeni + 4 more

Background: The Speech-evoked Frequency Following Response (sFFR) provides spctro-temporal data on speech processing in the auditory system. Its effectiveness in extracting prosodic features like variations in fundamental frequency (F0 contour) and intensity is uncertain. Objectives: This study examines how well sFFR tracks F0 contour in different emotions using a natural two-syllable word. It also explores talker’s gender impact on F0 contours and gender disparity in encoding prosodic cues. Method: The word “balloon” spoken by male and female speakers with sad and happy emotions, elicited FFR from 16 adults (8 males, aged 18–31). A pitch estimation algorithm calculated root mean squared error and 5% accuracy to evaluate the response’s fidelity to F0 contour under different conditions. Results: The sFFR tracked prosodic speech features, influenced by emotion type and talker voice characteristics. Participants identified emotions most accurately from sad male voices. Lower F0 trajectories corresponded to more reliable FFR responses, showing better tracking of male voices and sad emotions. No significant gender-related differences were observed in emotional data processing. Conclusion: These findings highlight sFFR’s utility in capturing dynamic speech properties and its potential in clinical assessments. Future research should explore prosody processing in hearing-impaired individuals and consider integrating sFFR into diagnostic protocols.

  • Research Article
  • Cite Count Icon 30
  • 10.1016/j.jvoice.2005.04.005
Multidimensional Scaling of Breathy Voice Quality: Individual Differences in Perception
  • Sep 13, 2005
  • Journal of Voice
  • Rahul Shrivastav

Multidimensional Scaling of Breathy Voice Quality: Individual Differences in Perception

  • Research Article
  • Cite Count Icon 9
  • 10.1111/psyp.13944
Infants' neutral facial expressions elicit the strongest initial attentional bias in adults: Behavioral and electrophysiological evidence.
  • Sep 22, 2021
  • Psychophysiology
  • Yun Cheng Jia + 6 more

Recent studies that used adult faces as the baseline have revealed that attentional bias toward infant faces is the strongest for neutral expressions than for happy and sad expressions. However, the time course of the strongest attentional bias toward infant neutral expressions is unclear. To clarify this time course, we combined a behavioral dot-probe task with electrophysiological event-related potentials (ERPs) to measure adults' responses to infant and adult faces with happy, neutral, and sad expressions derived from the same face. The results indicated that compared with the corresponding expressions in adult faces, attentional bias toward infant faces with various expressions resulted in different patterns during rapid and prolonged attention stages. In particular, first, neutral expressions in infant faces elicited greater behavioral attentional bias and P1 responses than happy and sad ones did. Second, sad expressions in infant faces elicited greater N170 responses than neutral and happy ones did; notably, sad expressions elicited greater N170 responses in the left hemisphere in women than in men. Third, late positive potential (LPP) responses were greater for infant faces than for adult faces under each expression condition. Thus, we propose a three-stage model of attentional allocation patterns that reveals the time course of attentional bias toward infant faces with various expressions. This model highlights the prominent role of neutral facial expressions in the attentional bias toward infant faces.

  • Research Article
  • Cite Count Icon 4
  • 10.1521/pedi_2021_35_514
Facial Emotion Perception in Families Affected With Borderline Personality Disorder.
  • Mar 1, 2021
  • Journal of personality disorders
  • Tahira Gulamani + 3 more

Emotion perception biases may precipitate problematic interpersonal interactions in families affected with borderline personality disorder (BPD) and lead to conflictual relationships. In the present study, the authors investigated the familial aggregation of facial emotion recognition biases for neutral, happy, sad, fearful, and angry expressions in probands with BPD (n = 89), first-degree biological relatives (n = 67), and healthy controls (n = 87). Relatives showed comparable accuracy and response times to controls in recognizing negative emotions in aggregate and most discrete emotions. For sad expressions, both probands and relatives displayed slower response latencies, and they were more likely than controls to perceive sad expressions as fearful. Nonpsychiatrically affected relatives were slower than controls in responding to negative emotional expressions in aggregate, and fearful and sad facial expressions more specifically. These findings uncover potential biases in perceiving sad and fearful facial expressions that may be transmitted in families affected with BPD.

  • Research Article
  • Cite Count Icon 10
  • 10.1016/j.applanim.2023.106146
Context effects on duration, fundamental frequency, and intonation in human-directed domestic cat meows
  • Dec 22, 2023
  • Applied Animal Behaviour Science
  • Susanne Schötz + 2 more

In this study, we investigated the prosody of domestic cat meows produced in different contexts. Prosodic cues – i.e., variation in intonation, duration, voice quality and fundamental frequency – in humans as well as in nonhuman animals carry information about idiosyncratic traits of the signaller, including sex, age, and physical and mental state. The duration, fundamental frequency (F0) and intonation in a sample of 969 meows recorded in seven different contexts (i.e., cuddle, door, food, greeting, lifting, play, cat carrier) were analysed using linear mixed effects regression and generalized additive models. In this, we controlled for cat age and sex, as meows produced by old cats had lower mean F0 than those produced by young cats, and female cats produced meows with higher mean F0 than male cats. We found significant effects of context on duration and mean F0, but not on F0 range. Furthermore, the results showed that the intonation of meows produced by cats in a cat carrier displayed a falling pattern, while that of meows produced in cuddle and door contexts was relatively level, and that of meows produced in the other contexts consisted of combinations of rising and falling. The average slope of meows produced in cat carrier and play contexts was negative, while that of meows produced in the other contexts was positive. We argue that this prosodic variation reflects the cats’ mental or emotional state, because of valence and arousal differences associated with the various contexts that were included in the study. Further studies will need to confirm this. In addition, we also plan additional analyses of spectral and voice quality parameters in meows and other cat vocalisation types.

  • Research Article
  • 10.1121/1.403789
Separation of trends and subharmonic structures from random variations in fundamental frequency and amplitude.
  • Apr 1, 1992
  • The Journal of the Acoustical Society of America
  • David A Berry + 1 more

An implicit assumption in voice perturbation analysis is that one is dealing with small random variations in fundamental frequency and amplitude. However, many irregularities in voiced speech are not necessarily small or random (e.g., subharmonics, amplitude modulations, frequency modulations, linear trends). In this study, various processing schemes are employed to detect/remove trends and subharmonics from fundamental frequency and amplitude contours. Voice perturbation measures are calculated before and after application of the detection/removal techniques. Separating trends and subharmonics from random perturbations in voice analysis may prove useful in classifying and identifying voice disorders. Results of analysis on a variety of subjects will be presented.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.joms.2017.11.017
A Quantitative Assessment of Lip Movements in Different Facial Expressions Through 3-Dimensional on 3-Dimensional Superimposition: A Cross-Sectional Study
  • Nov 23, 2017
  • Journal of Oral and Maxillofacial Surgery
  • Daniele Gibelli + 4 more

A Quantitative Assessment of Lip Movements in Different Facial Expressions Through 3-Dimensional on 3-Dimensional Superimposition: A Cross-Sectional Study

  • Conference Article
  • Cite Count Icon 22
  • 10.1109/icassp.2009.4960640
Modeling instantaneous intonation for speaker identification using the fundamental frequency variation spectrum
  • Apr 1, 2009
  • Kornel Laskowski + 1 more

In recent years, the field of automatic speaker identification has begun to exploit high-level sources of speaker-discriminative information, in addition to traditional models of spectral shape. These sources include pronunciation models, prosodic dynamics, pitch, pause, and duration features, phone streams, and conversational interaction. As part of this broader thrust, we explore a new frame-level vector representation of the instantaneous change in fundamental frequency, known as fundamental frequency variation (FFV). The FFV spectrum consists of 7 continuous coefficients, and can be directly modeled in a standard Gaussian mixture model (GMM) framework. Our experiments indicate that FFV features contain useful information for discriminating among speakers, and that model-space combination of FFV and cepstral features outperforms cepstral features alone. In particular, our results on 16kHz Wall Street Journal data show relative reductions in error rate of 54% and 40% for female and male speakers, respectively.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant