Speech emotion recognition (SER) has received increased attention due to its extensive applications in many fields, especially in the analysis of teacher-student dialogue in classroom environment. It can help teachers to better learn about students’ emotions and thereby adjust teaching activities. However, SER has faced several challenges, such as the intrinsic ambiguity of emotions and the complex task of interpreting emotions from speech in noisy environments. These issues can result in reduced recognition accuracy due to a focus on less relevant or insignificant features. To address these challenges, this paper presents ESERNet, a Transformer-based model designed to effectively extract crucial clues from speech data by capturing both pivotal cues and long-range relationships in speech signal. The major contribution of our approach is a two-pathway SER framework. By leveraging the Transformer architecture, ESERNet captures long-range dependencies within speech mel-spectrograms, enabling a refined understanding of the emotional cues embedded in speech signals. Extensive experiments were conducted on the IEMOCAP and EmoDB datasets, the results show that ESERNet achieves state-of-the-art performance in SER and outperforms existing methods by effectively leveraging critical clues and capturing long-range dependencies in speech data. These results highlight the effectiveness of the model in addressing the complex challenges associated with SER tasks.