Abstract
A novel algorithm to estimate instance-level future motion (FM) in a single image is proposed in this paper. First, the FM of an instance is defined with its direction, speed, and action classes. Then, a deep neural network, called FM-Net, is developed to determine the FM of the instance. More specifically, the multi-context pooling layer is proposed to exploit both object and global context features, and the cyclic ordinal regression scheme is developed using binary classifiers for effective FM classification. Also, the proposed FM-Net is trained in a semi-supervised domain adaptation setting to obtain reliable FM estimation results, even when a source domain in the training process and a target domain in the inference process are different. Extensive experimental results demonstrate that the proposed algorithm provides remarkable performance and thus can be used effectively for computer vision applications, including single object tracking, multiple object tracking, and crowd analysis. Furthermore, the FM dataset, collected from diverse sources and annotated manually, is released as a benchmark for single-image FM estimation.
Highlights
Human perception has a capability of forecasting motions accurately, even from a single static image
It is demonstrated that the proposed semi-supervised domain adaptation learning improves future motion (FM) estimation accuracies, when only a limited number of labeled data for a new domain are available
FM-Net is proposed by incorporating the multi-context pooling (MCP) layer into DenseNet-121 and developing the cyclic ordinal regression (COR) scheme for future direction classification
Summary
Human perception has a capability of forecasting motions accurately, even from a single static image. Ma et al [13] develop another trajectory estimation method for multiple pedestrians based on game theory These instance-level algorithms [11]–[13], [30] estimate long-term FM, but they require additional information, such as past frames [11] or starting and end points [12], [13], [30]. Various computer vision applications have exploited semi-supervised learning methods to reduce expensive labeling efforts They include 3D human pose estimation [50], 3D hand pose estimation [51], deraining [52], scene parsing [53], multi-view keypoint detection [54], object detection [55], and skin detection in a single human portrait image [56]. The 3-way classification of the action is performed using an FC layer and a softmax layer, since there is no ordinal relation among the action classes of ‘sidewalk,’ ‘crosswalk,’ and ‘jaywalk.’
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.