Abstract
This paper presents a new method for single-image pose estimation of multiple people combining the traditional bottom-up and the top-down methods. Specifically, we extract features from the input image by a residual network and use a multistage CNN to learn both the confidence maps of joints and the connection relationships, between joints. During testing, we perform the network feedforwarding in a bottom-up manner, and then use the predicted confidence maps, the connection relationships, and the corresponding bounding boxes to parse the poses of all people in a top-down manner. In contrast to the previous top-down methods, our method is robust to bounding box shift and tightness, works well for largely overlapped people, and achieves faster running speed. In contrast to the bottom-up method, our method avoids mistake propagation across different people, and addresses disconnected joints effectively. To estimate human pose from videos, we impose a weight-sharing scheme to the multi-stage CNN, and rewrite it as a recurrent neural network. Thus, we can reuse the prediction results from the previous frames so as to reduce the total stage number, yielding significantly faster speed in invoking the network on videos. And we adopt LSTM units between frames to capture the temporal correlation among video frames. We found that LSTM handles input-quality degradation in videos well and successfully stabilizes the sequential outputs.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.