Abstract

Within convolutional neural networks, convolutional operations are good at extracting local features, but have difficulty in capturing global representations. For Vision Transformer, multi-head self-attention can capture feature dependencies over long distance, but can destruct local feature details. Based on this, we propose a novel lightweight model, named HybridNet, based on MobileNet-v2 and Vision Transformer, capable of combining the advantages of both CNNs and Vision Transformer. In addition, to enhance the capability of HybridNet for temporal information interaction, we incorporate temporal-channel attention in HybridNet. We conducted experiments on Kinetics-400, Jester, and EgoGesture datasets to validate the effectiveness of HybridNet. The experimental results demonstrate that the lightweight model HybridNet achieves 96.3% and 93.9% accuracy on Jester and EgoGesture, respectively, obtaining the performance close to or even comparable with the state-of-the-art methods. Last but not least, we take HybridNet as the real-time gesture recognition model and use the recognition results as commands to control robots in the simulation environment to achieve human–robot interaction. The use of gesture interaction between humans and robots improves communication, facilitates physical collaboration, enables non-verbal expression, enhances accessibility, and creates a more engaging user experience. It adds a dimension of intuitiveness and efficiency to human–robot interaction, making it more dynamic and interactive.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call