Abstract

Automatically understanding and describing the visual content of videos in natural language is a challenging task in computer vision. Existing approaches are often designed to describe single events in a closed-set setting. However, in real-world scenarios, concurrent activities and previously unseen actions may appear in a video. This work presents the OSVidCap, a novel open-set video captioning framework that recognizes and describes, in natural language, concurrent known actions and deal with unknown ones. The OSVidCap is based on the encoder-decoder framework and uses a detection-and-tracking-object-based mechanism followed by a background blurring method to focus on specific targets in a video. Additionally, we employ the TI3D Network with the Extreme Value Machine (EVM), which learns representations and recognizes unknown actions. We evaluate the proposed approach on the benchmark ActivityNet Captions dataset. Also, an enhanced version of the LIRIS human activity dataset was proposed by providing descriptions for each action. We also provide spatial, temporal, and caption annotations for existing unlabeled actions in the dataset - considered unknown actions in our experiments. Experimental results showed our method’s effectiveness in recognizing and describing concurrent actions in natural language and the strong ability to deal with detected unknown activities. Based on these results, we believe that the proposed approach can be potentially helpful for many real-world applications, including human behavior analysis, safety monitoring, and surveillance.

Highlights

  • Video understanding is a challenging issue in computer vision

  • THEORETICAL ASPECTS This Section presents the fundamentals of the methods used in our open-set recognition module: the Extreme Value Machine and the Triplet Inflated 3D Neural Network

  • METHODS we present the OSVidCap framework for video captioning

Read more

Summary

INTRODUCTION

Video understanding is a challenging issue in computer vision. It requires sophisticated techniques to process the diversity of humans and objects appearances in different environments and their relationships over time. Open-set classifiers allow performing classification by enclosing each class in the feature space and reserving space for new classes to emerge, unlike closed-set classifiers, which assign infinite spaces to training classes This strategy allows rejecting data from previously unknown classes instead of wrongly assigning the class label with the highest probability value [10]. Following this idea, a video captioning approach in an open-set scenario can adequately describe known actions and deal with unknown ones. This work presents a novel open-set video captioning framework that aims to describe, in natural language, single and concurrent events occurring in a video. We propose a novel video captioning framework to recognize and describe concurrent actions/activities performed by humans in an open-set scenario;.

RELATED WORKS
THE EXTREME VALUE MACHINE
METHODS
CAPTION GENERATION
IMPLEMENTATION DETAILS
1) Experiments on the Liris Dataset
2) Experiments on the ActivityNet Captions Dataset
EVALUATION METRICS
QUANTITATIVE RESULTS
CONCLUSION AND FUTURE WORKS
Findings
C F1-score
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.