Speech emotion recognition (SER) systems have become essential in various fields, including intelligent healthcare, customer service, call centers, automatic translation systems, and human–computer interaction. However, current approaches predominantly rely on single frame-level or utterance-level features, offering only shallow or deep characterization, and fail to fully exploit the diverse types, levels, and scales of emotion features. The limited ability of single features to capture speech emotion information, along with the ineffective combination of different features’ complementary advantages through simple fusion, pose significant challenges. To address these issues, this paper presents a novel spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms(STRL-SER). The proposed technique integrates fine-grained frame-level features and coarse-grained utterance-level emotion features, while employing separate modules to extract deep representations at different levels. In the frame-level module, we introduce parallel networks and utilize a bidirectional long short-term memory network (BiLSTM) and an attention-based multi-scale convolutional neural network (CNN) to capture the spatio-temporal representation details of diverse frame-level signals. Consequently, we extract deep representations of utterance-level features to effectively learn global speech emotion features. To leverage the advantages of different feature types, we introduce a multi-head attention mechanism that fuses the deep representations from various levels. This fusion approach retains the distinctive qualities of each feature type. Finally, we employ segment-level multiplexed decision making to generate the ultimate classification results. We evaluate the effectiveness of our proposed method on two widely recognized benchmark datasets: IEMOCAP and RAVDESS. The results demonstrate that our method achieves notable performance improvements compared to previous studies. On the IEMOCAP dataset, our method achieves a weighted accuracy (WA) of 81.60% and an unweighted accuracy (UA) of 79.32%. Similarly, on the RAVDESS dataset, we achieve a WA of 88.88% and a UA of 87.85%. These outcomes confirm the substantial advancements realized by our proposed method.