This study proposes a new approach to Speech Emotion Recognition (SER) that combines a Mutual Information (MI)-based feature selection strategy with simple machine learning classifiers such as K-Nearest Neighbor (KNN), Gaussian Mixture Model (GMM), and Support Vector Machine (SVM), along with a voting rule method. The main contributions of this approach are twofold. First, it significantly reduces the complexity of the SER system by addressing the curse of dimensionality by integrating a focused feature selection process, resulting in considerable savings in both computational time and memory usage. Second, it enhances classification accuracy by using selected features, demonstrating their effectiveness in improving the overall performance of the SER system. Experiments carried out on the EMODB dataset, using various feature descriptors, including Mel-frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction (PLP), and Linear Prediction Cepstral Coefficients (LPCC), showed that the best performance was achieved by GMM, with an accuracy of 85.27% using 39 MFCC features, compared to an accuracy of 82.55% using a high-dimensional vector with 111 features. Furthermore, applying the Joint Mutual Information (JMI) selection technique to extracted MFCC features reduces the vector size by 23.07% while improving the accuracy to 86.82%. These results highlight the effectiveness of combining the feature selection process with machine learning algorithms and the voting rules method for the SER task.
Read full abstract