Abstract

The ability of machines to understand human subjective emotions is an essential link to realize artificial intelligence. How to extract and utilize information from audio signals is still a challenging task. By transforming acoustic signals into time-domain information represented by spectrograms, advanced algorithms in the field of computer vision can be applied to the field of acoustics. In this paper, we propose a Speech Emotion Recognition(SER) system based on Swin-Transformer(Swin). In addition to verifying the feasibility of Swin in SER task, we also compared the effectiveness of various spectrum maps under the same model parameters. Our model is validated on the IEMOCAP dataset and achieves competitive performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.