Abstract

Speech emotion recognition task, as well as most audio recognition machine learning tasks, uses the so-called framing. This is the process of dividing the original audio signal into frames of a certain size, each of which is processed separately. This article presents a comparison of the effect of frame size on the emotion recognition result using a CNN network as an example. For the experiments, the CREMA-D dataset was used with the augmentations using noise adding, time stretching, and pitch shifting. We managed to achieve a recognition accuracy of 98.8% using dynamic frame size.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.