Abstract

State-of-the-art Automatic Speech Recognition (ASR) systems convert the spoken words into a corresponding text. One of the problems faced in ASR is that speakers have a different way of pronouncing words, and their accents are different from one speaker to another due to age, gender, nationality, rapidity of words, expressive form of the speaker. This paper uses two data sets, Surrey Audio-Visual Expressed Emotion (SAVEE) and The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) data sets to determine the effect of the tone in the learning environment by using the ASR and check which classifier is giving the best result. Feature such energy, Mel filter Central coefficients, energy etc. were extracted using jAudio and Waikato Environment for Knowledge Analysis (WEKA) data mining tools was used for classification. Classifiers called multilayer Perceptron (MLP) neural network model, Support Vector Machines (SVM), Simple Logistic Regression (SLR), K-Nearest Neighbour (K-NN) and Random Forests (RF) was used to obtain the results of the emotion state for the both data sets. The data sets used to train the classifiers are in ARFF format. The results show that SAVEE data sets overcomes RAVDESS data sets in overall emotion classification performance. The result shows that RF performed better than the other classifier. The performance of classification models is evaluated in WEKA using 10-fold cross validation. The presented study examines seven emotions- anger, happiness, sadness, fear, surprise, disgust and neutral.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call