The expansion of virtual and augmented reality, intelligent assistance technology, and other fields has led to an increased demand for human–computer interaction methods that are more natural and intuitive. Gesture recognition has become an important research direction. Traditional gesture recognition methods are mainly based on image processing and pattern recognition techniques. However, there are certain challenges to its accuracy and robustness in complex backgrounds. In addition, the temporal correlation and spatial information in gesture sequences have not been fully utilized, which limits the performance of gesture recognition systems. In response to the above issues, this study first utilizes the Ghost module for feature extraction based on the You Only Look Once version 5 (YOLOv5) algorithm. Then drawing inspiration from the idea of densely connected networks, feature map stitching is carried out, and a human–machine interactive gesture recognition algorithm is designed by combining it with a hybrid attention mechanism. The experimental results showed that the average accuracy of the algorithm tended to converge after 160 iterations, and the final MAP value converged to 92.19%. Compared to the standard YOLOv5 algorithm, its iteration speed had been improved by 12.5%, and the MAP value had been improved by 4.63%. The designed human–computer interaction gesture recognition algorithm has higher accuracy and smaller error, and has certain application potential in the field of machine vision.