Abstract: Emotions serve as the cornerstone of human communication, facilitating the expression of one's inner thoughts and feelings to others. Speech Emotion Recognition (SER) represents a pivotal endeavour aimed at deciphering the emotional nuances embedded within a speaker's voice signal. Universal emotions such as neutrality, anger, happiness, and sadness form the basis of this recognition process, allowing for the identification of fundamental emotional states. To achieve this, spectral and prosodic features are leveraged, each offering unique insights into the emotional content of speech. Spectral features, exemplified by the Mel Frequency Cepstral Coefficient (MFCC), provide a detailed analysis of the frequency distribution within speech signals, while prosodic features encompass elements like fundamental frequency, volume, pitch, speech intensity, and glottal parameters, capturing the rhythmic and tonal variations indicative of different emotional states. Through the integration of these features, SER systems can effectively simulate and classify a diverse range of emotional expressions, paving the way for enhanced human-computer interaction and communication technologies
Read full abstract