With the increasing awareness of fitness, more and more people are choosing to participate in fitness activities. Yoga, as a form of exercise that improves both physical and mental health, is becoming increasingly popular worldwide. In order to assist yoga practitioners in more effective training through automated or semi automated systems, improve training effectiveness, assist professional athletes in training through intelligent recognition systems, correct movements, and improve athletic performance. This paper proposes a method that addresses the low accuracy issue of current yoga pose recognition algorithms by integrating multi-head attention mechanism and ensemble learning. Firstly, the Mixup algorithm is used to enhance yoga movement images. Subsequently, convolutional features are extracted from the images using the ResNet101 and VGGNet19 transfer learning models. Finally, the extracted convolutional features are combined and stacked using a multi-head attention mechanism. Model training, validation, and testing are performed using the Soft target cross-entropy loss function. Experimental results demonstrate that the proposed method achieves a training accuracy of 100%, a validation accuracy of 89.94%, a testing accuracy of 93.79%, and a detection speed of 297 frames per second. Overall, this method demonstrates high stability and robustness, providing a technological foundation for intelligent recognition of yoga poses.