Abstract

Video-level sign language recognition is still a challenging task due to the influence of sign language-independent factors and timing requirements. This paper constructs a sign language recognition framework based on global-local feature description, and proposes a three-dimensional residual global network model with attention layer and a local network model based on target detection. The global feature description is based on the whole video behavior for time series modeling. The improved timing conversion layer is used to explore the timing information of different periods and learn the video representations of different timings. In the local module the hand is located through the target detection network to highlight its key role in the whole sign language behavior, which strengthens the category differences, and compensates the global network. Experiments on two well-known Chinese sign language datasets (SLR_Dataset and DEVSIGN_D) show that the proposed method can obtain higher recognition accuracy (respectively 89.2%, 91%) and better generalization performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call