Abstract

Temporal action detection in long, untrimmed videos is an important yet challenging task that requires not only recognizing the categories of actions in videos, but also localizing the start and end times of each action. Recent years, artificial neural networks, such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) improve the performance significantly in various computer vision tasks, including action detection. In this paper, we make the most of different granular classifiers and propose to detect action from fine to coarse granularity, which is also in line with the people’s detection habits. Our action detection method is built in the ‘proposal then classification’ framework. We employ several neural network architectures as deep information extractor and segment-level (fine granular) and window-level (coarse granular) classifiers. Each of the proposal and classification steps is executed from the segment to window level. The experimental results show that our method not only achieves detection performance that is comparable to that of state-of-the-art methods, but also has a relatively balanced performance for different action categories.

Highlights

  • Video analysis is important for applications ranging from robotics, human-computer interaction to intelligent surveillance

  • The second framework ‘proposal classification’ [3,4] draws inspiration from the Region-based Convolutional Neural Networks (R-CNN) object detection [5] and its upgraded versions [6,7]. It is implemented in two steps: (1) temporal action proposals, which produces a set of windows that are likely to contain an action instance; and, (2) action classification, which provides the specific category of the action proposal

  • We propose to detect action in video from fine to coarse granularity, which is in line

Read more

Summary

Introduction

Video analysis is important for applications ranging from robotics, human-computer interaction to intelligent surveillance. The second framework ‘proposal classification’ [3,4] draws inspiration from the Region-based Convolutional Neural Networks (R-CNN) object detection [5] and its upgraded versions [6,7] It is implemented in two steps: (1) temporal action proposals, which produces a set of windows that are likely to contain an action instance; and, (2) action classification, which provides the specific category of the action proposal. Post-processing; (b) proposal classification; (c) single stream; and, (d) temporal upsampling Most of these mentioned action detection methods design fine granular [4] trained a 3D CNN classifier via Multi-task learning method for segment-level proposal, and localization. Both of the methods [3,4] used post-processing to obtain the final detection results.

Overview
Previous
Our Method
Overview of Our Method
Res3D Architecture
Discriminative
The the binary binary RGB
Discriminative temporal search
Regression Network
Datasets and Evaluation Metrics
Experiments
Implementation Details
Exploratory Study
Average
Comparison with State-of-the-Art Methods
28.7 Proposal
Limitation
10. In regionregion from from
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call