Abstract
Human pose tracking is a challenging task that involves estimating the human pose and tracking it across multiple frames in a video sequence. In recent years, deep learning-based methods have made significant progress in this field, achieving state-of-the-art performance. However, due to complex background and occlusion among people missed detection and incorrect association matching are still the challenging problems. To address these issues, we adopt a top-down framework to perform human pose tracking in the paper. We propose a human detection prediction recovery module (HDP module) to recover missed detection, and propose a dual-stream fusion Siamese network for human matching (DFSTrack). Specifically, we design a residual graph convolutional block (RGCN block) for spatial position encoding of human keypoints, and use spatial self-attention and temporal cross-attention to design a dual-stream spatial–temporal fusion transformer (DST Transformer). The graph convolutional block and transformer are cascaded to simultaneously obtain information on the spatial and temporal positions of human keypoints, allowing the Siamese network to solve the erroneous human matching. Experimental results on the PoseTrack17 dataset, PoseTrack18 dataset and PoseTrack21 dataset demonstrate that our proposed method outperforms state-of-the-art methods on human pose tracking tasks. Our code and pretrained models are available at https://github.com/yhtian2023/DFSTrack.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.