Abstract Emotions in music education affect learners’ cognitive activities, and failure to capture learners’ emotional changes in a timely manner can lead to music teachers’ inability to adjust their teaching strategies in a timely manner. In this paper, a convolutional neural network is utilized to extract speech and visual emotion features of students during the process of music education. The spatial plane fusion method is used to fuse the speech and visual emotion modalities. A cross-modal interactive attention mechanism is introduced to optimize the fusion effect of the multimodal emotion features. Then, a support vector machine is used to identify and classify the emotion features. The study shows that the multimodal emotion recognition model proposed in this paper can achieve an emotion recognition accuracy of 88.78%, can accurately recognize the emotional state of students, and can assist teachers in effectively intervening in the negative emotions of students. In the music classroom applying this technology, the average test score of the student’s music education program is 93.70, and their will to learn music education is 95.09% on average. This paper’s multimodal emotion recognition model helps teachers implement effective interventions in music education and establishes the foundation for improving students’ interest in music learning.