Abstract
This thesis addresses video-based multi-person, multi-label, spatiotemporal action detection and recognition. This is a challenging problem because each person can be performing several actions at the same time (e.g. talking and walking), and simultaneously other actors can be performing different actions. We claim that these are problems where the use of contextual information (e.g. semantic descriptions of the scene) may lead to significant performance improvements. In this work, we develop several approaches to tackle this problem and validate them in challenging datasets. We propose a framework to integrateand test multiple sources of contextual information in video-based multi-person, multi-label, spatiotemporal action detection and recognition. We highlight six contributions,and that are collected in three publications (at different stages of publication at the time of this writing). The first contribution is a proposed Multisource Video Classification(MVC) framework that allows the combination of several sources of context information, for which we consider four types: actor centric input filtering (a way to focus attentionon the actor under analysis but still gather appearance information from the neighborhood), semantic neighbor context (a way to inform the model with the actions performed by nearby agents), object detection (how objects interacting with the actor can inform about its action) and pose data (how high level features extracted from the actor can help the classification process). The second contribution is a foveated approach to actor centric filtering for input selection that weights the appearance information in a decreasing way, from the center to the periphery of the actor bounding box. The third contribution is a novel encodingfor the semantic neighbor context and its custom classifier with spatial and temporal dependence. The fourth is a custom Hybrid Sigmoid-Softmax loss function for the multiclass/ multi-label case, that combines the cross-entropy loss typical of multi-class problems with the sum-of-sigmoids loss used in the multi label case. The fifth is the application of the developed methods to a challenging dataset with a large number of videos with mulivtiple agents performing multiple actions, with 80 heterogeneous and highly unbalanced classes. To allow research with reasonable computer power, we have created the mini-AVA, a partition of AVA that maintains temporal continuity and class distribution with only one tenth of the dataset size. The sixth contribution is a collection of ablation studies on alternative actor centric filters and semantic neighbor context classifiers. From this research we achieve a relative mAP improvement of 18:8% using our foveated actor centricfiltering, relative mAP improvement of 5% using our semantic neighbor context embedding and models, and a relative mAP improvement of 12:6% using our custom Hybrid Sigmoid-Softmax loss.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.