Abstract

A multimodal emotion recognition system is proposed using speech and facial images. For this purpose a video database is developed, containing emotions in three affective states viz. anger, sad and happiness. The audio and the snapshots of facial expressions acquired from the videos constituted the bimodal input for recognizing emotions. The spoken sentences in the database included text dependent as well as text independent sentences in Malayalam language. The audio features included short-time processing of speech to obtain: energy, zero crossing count, pitch and Mel Frequency Cepstral Coefficients. For facial expressions, the landmark features of face: eyebrows, eyes and mouth, obtained using Viola Jones Algorithm is used. The supervised learning methods K-Nearest Neighbor and Artificial Neural Network are used for emotion analysis. The system performance is evaluated for 3 cases viz. using audio based features and facial features separately and for both features taken together. Further, the effect of text dependent and text independent audio is also analyzed. The result of the analysis shows that text independent videos (utilizing both modalities) using K-Nearest Neighbor (highest accuracy 82.78%) is found to be more effective in recognizing emotions from the database considered.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call