Abstract
Automatically understanding and describing the visual content of videos in natural language is a challenging task in computer vision. Existing approaches are often designed to describe single events in a closed-set setting. However, in real-world scenarios, concurrent activities and previously unseen actions may appear in a video. This work presents the OSVidCap, a novel open-set video captioning framework that recognizes and describes, in natural language, concurrent known actions and deal with unknown ones. The OSVidCap is based on the encoder-decoder framework and uses a detection-and-tracking-object-based mechanism followed by a background blurring method to focus on specific targets in a video. Additionally, we employ the TI3D Network with the Extreme Value Machine (EVM), which learns representations and recognizes unknown actions. We evaluate the proposed approach on the benchmark ActivityNet Captions dataset. Also, an enhanced version of the LIRIS human activity dataset was proposed by providing descriptions for each action. We also provide spatial, temporal, and caption annotations for existing unlabeled actions in the dataset - considered unknown actions in our experiments. Experimental results showed our method’s effectiveness in recognizing and describing concurrent actions in natural language and the strong ability to deal with detected unknown activities. Based on these results, we believe that the proposed approach can be potentially helpful for many real-world applications, including human behavior analysis, safety monitoring, and surveillance.
Highlights
Video understanding is a challenging issue in computer vision
THEORETICAL ASPECTS This Section presents the fundamentals of the methods used in our open-set recognition module: the Extreme Value Machine and the Triplet Inflated 3D Neural Network
METHODS we present the OSVidCap framework for video captioning
Summary
Video understanding is a challenging issue in computer vision. It requires sophisticated techniques to process the diversity of humans and objects appearances in different environments and their relationships over time. Open-set classifiers allow performing classification by enclosing each class in the feature space and reserving space for new classes to emerge, unlike closed-set classifiers, which assign infinite spaces to training classes This strategy allows rejecting data from previously unknown classes instead of wrongly assigning the class label with the highest probability value [10]. Following this idea, a video captioning approach in an open-set scenario can adequately describe known actions and deal with unknown ones. This work presents a novel open-set video captioning framework that aims to describe, in natural language, single and concurrent events occurring in a video. We propose a novel video captioning framework to recognize and describe concurrent actions/activities performed by humans in an open-set scenario;.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.