Abstract

How do humans localize unintentional action like " A boy falls down while playing skateboard "? Cognitive science shows that an 18-month-old baby understands the intention by observing the actions and comparing the feedback. Motivated by this evidence, we propose a causal inference approach that constructs a video pool containing intentional knowledge, conducts the counterfactual intervention to observe intentional action, and compares the unintentional action with intentional action to achieve localization. Specifically, we first build a video pool, where each video contains the same action content as an original unintentional action video. Then we conduct the counterfactual intervention to generate counterfactual examples. We further maximize the difference between the predictions of factual unintentional action and counterfactual intentional action to train the model. By disentangling the effects of different clues on the model prediction, we encourage the model to highlight the intention clue and alleviate the negative effect brought by the training bias of the action content clue. We evaluate our approach on a public unintentional action dataset and achieve consistent improvements on both unintentional action recognition and localization tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.