OSVidCap: A Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

Andrei De Souza Inacio,Heitor Silverio Lopes,Matheus Gutoski,Andre Eugenio Lazzaretti

doi:10.1109/access.2021.3116882

Abstract

Automatically understanding and describing the visual content of videos in natural language is a challenging task in computer vision. Existing approaches are often designed to describe single events in a closed-set setting. However, in real-world scenarios, concurrent activities and previously unseen actions may appear in a video. This work presents the OSVidCap, a novel open-set video captioning framework that recognizes and describes, in natural language, concurrent known actions and deal with unknown ones. The OSVidCap is based on the encoder-decoder framework and uses a detection-and-tracking-object-based mechanism followed by a background blurring method to focus on specific targets in a video. Additionally, we employ the TI3D Network with the Extreme Value Machine (EVM), which learns representations and recognizes unknown actions. We evaluate the proposed approach on the benchmark ActivityNet Captions dataset. Also, an enhanced version of the LIRIS human activity dataset was proposed by providing descriptions for each action. We also provide spatial, temporal, and caption annotations for existing unlabeled actions in the dataset - considered unknown actions in our experiments. Experimental results showed our method’s effectiveness in recognizing and describing concurrent actions in natural language and the strong ability to deal with detected unknown activities. Based on these results, we believe that the proposed approach can be potentially helpful for many real-world applications, including human behavior analysis, safety monitoring, and surveillance.

Highlights

Video understanding is a challenging issue in computer vision
THEORETICAL ASPECTS This Section presents the fundamentals of the methods used in our open-set recognition module: the Extreme Value Machine and the Triplet Inflated 3D Neural Network
METHODS we present the OSVidCap framework for video captioning

Summary

INTRODUCTION

Video understanding is a challenging issue in computer vision. It requires sophisticated techniques to process the diversity of humans and objects appearances in different environments and their relationships over time. Open-set classifiers allow performing classification by enclosing each class in the feature space and reserving space for new classes to emerge, unlike closed-set classifiers, which assign infinite spaces to training classes This strategy allows rejecting data from previously unknown classes instead of wrongly assigning the class label with the highest probability value [10]. Following this idea, a video captioning approach in an open-set scenario can adequately describe known actions and deal with unknown ones. This work presents a novel open-set video captioning framework that aims to describe, in natural language, single and concurrent events occurring in a video. We propose a novel video captioning framework to recognize and describe concurrent actions/activities performed by humans in an open-set scenario;.

RELATED WORKS

THE EXTREME VALUE MACHINE

METHODS

CAPTION GENERATION

IMPLEMENTATION DETAILS

1) Experiments on the Liris Dataset

2) Experiments on the ActivityNet Captions Dataset

EVALUATION METRICS

QUANTITATIVE RESULTS

CONCLUSION AND FUTURE WORKS

Findings

C F1-score

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

OSVidCap: A Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Representing and reasoning about concurrent actions with abductive logic programs
Renwei Li ... Luís Moniz Pereira
Annals of Mathematics and Artificial Intelligence | VOL. 21
Renwei Li, et. al.Renwei Li ... Luís Moniz Pereira
01 Jan 1997
Annals of Mathematics and Artificial Intelligence | VOL. 21

Understanding action recognition in still images
Deeptha Girish ... Anca Ralescu
-
Deeptha Girish, et. al.Deeptha Girish ... Anca Ralescu
01 Jun 2020
01 Jun 2020

Attention-driven action retrieval with DTW-based 3d descriptor matching
Rongrong Ji ... Tianqiang Liu
-
Rongrong Ji, et. al.Rongrong Ji ... Tianqiang Liu
26 Oct 2008
26 Oct 2008

Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks
Hanbo Wu ... Xin Ma
International Journal of Advanced Robotic Systems | VOL. 16
Hanbo Wu, et. al.Hanbo Wu ... Xin Ma
01 Jan 2019
International Journal of Advanced Robotic Systems | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

OSVidCap: A Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access