Abstract
Noise and constant empirical motion constraints affect the extraction of distinctive spatiotemporal features from one or a few samples per gesture class. To tackle these problems, an adaptive local spatiotemporal feature (ALSTF) using fused RGB-D data is proposed. First, motion regions of interest (MRoIs) are adaptively extracted using grayscale and depth velocity variance information to greatly reduce the impact of noise. Then, corners are used as keypoints if their depth, and velocities of grayscale and of depth meet several adaptive local constraints in each MRoI. With further filtering of noise, an accurate and sufficient number of keypoints is obtained within the desired moving body parts (MBPs). Finally, four kinds of multiple descriptors are calculated and combined in extended gradient and motion spaces to represent the appearance and motion features of gestures. The experimental results on the ChaLearn gesture, CAD-60 and MSRDailyActivity3D datasets demonstrate that the proposed feature achieves higher performance compared with published state-of-the-art approaches under the one-shot learning setting and comparable accuracy under the leave-one-out cross validation.
Highlights
Vision-based gesture recognition is a critical interface for non-intrusive human-robot interaction (HRI) systems, for which many natural and convenient recognition methods have been proposed [1,2,3].In a traditional HRI system, a user usually needs to memorize a predefined gesture language or a set of instructions before interaction
The proposed method are compared with the current state-of-the-art approaches using the 2011/2012 ChaLearn One-shot-learning Gesture Dataset (CGD) [21], the Cornell Activity
The Mean Levenshtein Distance (MLD) scores are calculated with different values of γ ∈ [1, 2, 3, 4] and η ∈ [2, 4, 8, 16] on the development batch of the CGD
Summary
Vision-based gesture recognition is a critical interface for non-intrusive human-robot interaction (HRI) systems, for which many natural and convenient recognition methods have been proposed [1,2,3].In a traditional HRI system, a user usually needs to memorize a predefined gesture language or a set of instructions before interaction. When a user plans to define and use arbitrary interaction gestures, existing HRI systems require labour-intensive and time-consuming procedures to collect and label all the training samples. This is because most of the traditional methods based on supervised learning require many training samples and long training times. All of these factors render a typical HRI system unsuitable for adding user-defined gestures and motivate the current research of one-shot learning gesture recognition in solving the above mentioned problems
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.