Abstract

This paper presents a new method for single-image pose estimation of multiple people combining the traditional bottom-up and the top-down methods. Specifically, we extract features from the input image by a residual network and use a multistage CNN to learn both the confidence maps of joints and the connection relationships, between joints. During testing, we perform the network feedforwarding in a bottom-up manner, and then use the predicted confidence maps, the connection relationships, and the corresponding bounding boxes to parse the poses of all people in a top-down manner. In contrast to the previous top-down methods, our method is robust to bounding box shift and tightness, works well for largely overlapped people, and achieves faster running speed. In contrast to the bottom-up method, our method avoids mistake propagation across different people, and addresses disconnected joints effectively. To estimate human pose from videos, we impose a weight-sharing scheme to the multi-stage CNN, and rewrite it as a recurrent neural network. Thus, we can reuse the prediction results from the previous frames so as to reduce the total stage number, yielding significantly faster speed in invoking the network on videos. And we adopt LSTM units between frames to capture the temporal correlation among video frames. We found that LSTM handles input-quality degradation in videos well and successfully stabilizes the sequential outputs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call