Abstract

This paper proposes an improved selective kernel network-temporal convolutional (SKResNet-TCN) network-based video recognition model for isolated word sign language with too large parameters, large computation, and difficult to extract effective features. SKResNet uses grouped convolution to save computational cost while dynamically selecting feature information of different perceptual fields to improve the feature extraction ability of the model for video frame images, and TCN introduces causal and inflation convolution to take full advantage of computer parallel computing and reduce memory overhead during computation. The introduction of causal convolution and dilation convolution allows the network to take full advantage of computer parallel computing and reduce memory overhead during computation, and it can capture the feature information between consecutive frames. In this paper, we design a hybrid SKResNet-TCN network model based on these two networks, and propose a solution of hybrid inflated convolution for the problem of losing information between data features in inflated convolution, using adaptive maximum pooling to preserve significant features of sign language instead of adaptive average pooling, and using Mish activation function to improve the generalization ability and accuracy of the model. The accuracy is 100% on the Argentine LSA64 dataset, and the experimental results show that the model in this paper has the advantages of fewer model parameters, smaller operations, and higher accuracy in sign language recognition compared with traditional 3D convolutional networks and long–short term memory, which effectively saves computational cost and time cost.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call