In the realm of artificial intelligence and musicology, emotion recognition in music performance has emerged as a pivotal area of research. This paper introduces EmoTrackNet, an integrated deep learning framework that combines sparse attention networks, enhanced one-dimensional residual Convolutional Neural Networks (CNNs) with an improved Inception module, and Gate Recurrent Units (GRUs). The synergy of these technologies aims to decode complex emotional cues embedded in music. Our methodology revolves around leveraging the sparse attention network to efficiently process temporal sequences, thereby capturing the intricate dynamics of musical pieces. The incorporation of the 1D residual CNN with an upgraded Inception module facilitates the extraction of nuanced features from audio signals, encompassing a broad spectrum of musical tones and textures. The GRU component further refines the model’s capability to retain and process sequential information over longer timeframes, essential for understanding evolving emotional expressions in music. We evaluated EmoTrackNet on the Soundtrack dataset a comprehensive collection of music pieces annotated with emotional labels. The results demonstrated remarkable improvements in the accuracy of emotion recognition, outperforming existing models. This enhanced performance can be attributed to the integrated approach, which efficiently combines the strengths of each component, leading to a more robust and sensitive emotion detection system. EmoTrackNet’s novel architecture and promising results pave the way for new avenues in musicology, particularly in understanding and interpreting the emotional depth of musical performances. This framework not only contributes significantly to the field of music emotion recognition but also has potential applications in music therapy, entertainment, and interactive media where emotional engagement is key.