Abstract
Speech emotion recognition is an important branch of natural language processing that aims to automatically recognize and classify emotional information in speech through computer technology. In the specific language environment of Tibetan, due to relatively limited research and some existing studies appearing cumbersome and complex in feature extraction steps, a new network model has been proposed. The model is based on a capsule network and achieves lightweight design. It only uses Mel Frequency Cepstral Coefficients (MFCC) as its input features, extracts the spatiotemporal information of MFCC through multiple convolutional layers, and sends it into the capsule network for deep analysis. The recognition rate of 81.52% was achieved on the self-built Tibetan language emotion corpus TBSEC001. Meanwhile, the method achieved an unweighted accuracy (UA) of 85.63% and 95.54% respectively on the EMO-DB and RAVDESS public corpora, demonstrating the method's effectiveness.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have