Abstract

Previous speech emotion recognition (SER) methods normally deal with variable-length utterance inputs by padding shorter ones or clipping longer ones into equal-length utterances, which may introduce invalid information or discard useful emotional segments. To address this issue, in this paper, we cast the SER problem into a graph classification task by transforming variable-length utterances into graphs to avoid padding or cutting. In our approach, frames (short windowed segments) in an utterance are presented as nodes in a graph. Acoustic features extracted from frames are treated as node feature vectors and nodes are connected according to their temporal relationship. Different graph convolutional networks (GCNs) are explored for node/frame embedding learning, and kinds of graph pooling methods are compared to obtain graph/utterance-level emotional representation from node embeddings. Extensive experiments with different GCN components and pooling mechanisms are conducted on the IEMOCAP and MSP-IMPRO datasets. The experimental results show that a combination of GraphSAGE with multi-head attention pooling (MHAPool) achieves the best weighted accuracy (WA) and comparable unweighted accuracy (UA) on both datasets compared with other state-of-the-art SER models, which demonstrates the effectiveness of the proposed graph-based network for SER task.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.