SqueezeNet and Fusion Network-Based Accurate Fast Fully Convolutional Network for Hand Detection and Gesture Recognition

Baohua Qiang,Yijie Zhai,Yufeng Wang,Mingliang Zhou,Yuanchao Pang,Bo Peng,Xianyi Yang

doi:10.1109/access.2021.3079337

Abstract

Accurate fast hand detection and gesture recognition for hand understanding are still challenging tasks that are influenced by the diversity of hands and the complexity of the scene in color images. To address the above problem, we propose a novel SqueezeNet and fusion network-based fully convolutional network (SF-FCNet) to accurately and quickly perform hand detection and gesture recognition in color images. First, we introduce the first 17-layer structure in the lightweight SqueezeNet as the hand feature extraction network to accelerate the detection and recognition speed by greatly compressing the network parameters. Second, a precise hand prediction fusion network is designed by adding a residual structure to the deconvolutional network to integrate high- and low-level features of hands, and hand detection and gesture recognition are performed on a single convolutional layer at multiple scales to improve the precision and reduce the computational costs. The verification results on the Oxford hand dataset show that SF-FCNet can reach a precision of 84.1% and a speed of 32 FPS. The experimental results show that SF-FCNet can substantially enhance the precision and speed of hand detection and gesture recognition on three benchmark datasets and has a strong generalization ability on a homemade test set.

Highlights

Human hand detection and recognition are regarded as a way for computers to understand human language, enabling people to communicate with machines and interact naturally without any mechanical equipment
To address the above problems, in this study, we investigated hand detection and gesture recognition on the Oxford hand dataset [7], EgoHands dataset [26], and National University of Singapore (NUS) hand posture dataset [8] and proposed a new method named the SqueezeNet and fusion network-based fully convolutional network (SF-FCNet) to accurately and quickly perform hand detection and gesture recognition on images
The feature map with gradually decreasing resolution is obtained by the precise hand prediction fusion network, and the feature map is expanded by the convolutional layer composed of the deconvolution layer and the residual structure

Summary

INTRODUCTION

Human hand detection and recognition are regarded as a way for computers to understand human language, enabling people to communicate with machines and interact naturally without any mechanical equipment. In some conventional methods [7]–[9], artificial features such as skin color and image shape are extracted, and the hands and gestures are detected and recognized through modeling and a support vector machines (SVMs) classifier These methods usually have great limitations due to the complexity of the hand, the challenge of modeling, and the inability to perform end-to-end training. To address the above problems, in this study, we investigated hand detection and gesture recognition on the Oxford hand dataset [7], EgoHands dataset [26], and National University of Singapore (NUS) hand posture dataset [8] and proposed a new method named the SqueezeNet and fusion network-based fully convolutional network (SF-FCNet) to accurately and quickly perform hand detection and gesture recognition on images.

RELATED WORK

LOSS FUNCTION

PROPOSED HAND DETECTION AND GESTURE RECOGNITION ALGORITHM

EXPERIMENTS

Findings

CONCLUSION