With the government’s increasing support for the virtual reality (VR)/augmented reality (AR) industry, it has developed rapidly in recent years. Gesture recognition, as an important human-computer interaction method in VR/AR technology, is widely used in the field of virtual reality. The current static gesture recognition technology has several shortcomings, such as low recognition accuracy and low recognition speed. A static gesture recognition algorithm based on improved YOLOv5s is proposed to address these issues. The content-aware re-assembly of features (CARAFE) is used to replace the nearest neighbor up-sampling method in YOLOv5s to make full use of the semantic information in the feature map and improve the recognition accuracy of the model for gesture regions. The adaptive spatial feature fusion (ASFF) method is introduced to filter out useless information and retain useful information for efficient feature fusion. The bottleneck transformer method is initially introduced into the gesture recognition task, reducing the number of model parameters and increasing the accuracy while accelerating the inference speed. The improved algorithm achieved an mAP(mean average precision) of 96.8%, a 3.1% improvement in average accuracy compared with the original YOLOv5s algorithm; the confidence level of the actual detection results was higher than the original algorithm.