Abstract
Human action can be recognized in still images since the whole image represents an action with some spatial clues, such as human poses, action-specific parts, and global surroundings. To represent the spatial clues, the recent methods require labor-intensive annotations to locate the human body and objects, which are computationally intensive. To eliminate strong supervision, a Multiple Spatial Clues Network (MSCNet) is proposed to represent the spatial clues with only image-level action label. Neither accurately manual annotated bounding boxes nor extra labeled datasets are required as additional supervision. First, the proposed MSCNet exploits spatial-attention module to generate spatial attention regions, and detects the spatial clues with minimal supervision. Then, spatial clues exploitation is proposed to utilize the learned spatial clues with three modules: the context module, body + context module and body + semantics module. Experiments on three benchmark datasets demonstrate the effectiveness of the proposed MSCNet.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.