Abstract
Most emotion recognition systems do not perform real-time emotion recognition due to latencies caused by phrase segmentation and resource-intensive feature acquisition, etc. To address this issue, we present an emotion recognition approach that can estimate speaker emotions with much lower latency. The proposed approach does not rely on phrase-level features to recognize speaker emotion; rather, it estimates the speaker’s emotional state over the course of the utterance incrementally, using a shifting n-word window on the basis of easily computable features. These features are obtained from three information streams, i.e. cepstral, prosodic and textual, at the wordlevel and combined at decision-level using a statistical framework. Our work shows that combining the three information streams yields higher emotion recognition accuracy than any single information stream. Using features extracted from n-word sequences rather than phrases provides for the low-latency capabilities of the proposed system, without any loss in utterance-level emotion recognition accuracy. The performance of the proposed system on a binary utterance-level emotion recognition task using an in-house database shows a relative improvement of 41% over chance, compared to a relative improvement of 31.82% shown by the baseline phrase-level emotion recognition approach.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.