Cross-Task Relation-Aware Consistency for Weakly Supervised Temporal Action Detection.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Temporal action detection aims to predict temporal boundaries and category labels of actions in untrimmed videos. In the past years, many weakly supervised temporal action detection methods have been proposed to relieve the annotation cost of fully supervised methods. Due to the discrepancy between action localization and action classification, the two-branch structure is widely adopted by existing weakly supervised methods, where the classification branch is used to predict category-wise score and the localization branch is used to predict foreground score for each segment. Under the weakly supervised setting, the model training is mainly guided by the video-level or sparse segment-level annotations. As a result, the classification branch tends to focus on the most discriminative segments while ignore less discriminative ones so as to minimize the classification cost, and the localization branch may assign high foreground scores for some negative segments. This phenomenon can severely damage the action detection performance, because the foreground scores and classification scores are combined together in the testing stage for action detection. To deal with this problem, several methods have been proposed to encourage the consistency between the classification branch and localization branch. However, these methods only consider the video-level or segment-level consistency, without considering the relation among different segments to be consistent. In this paper, we propose a Cross-Task Relation-Aware Consistency (CRC) strategy for weakly supervised temporal action detection, including an intra-video consistency module and an inter-video consistency module. The intra-video consistency module can well guarantee the relationship among segments from the same video to be consistent, and the inter-video consistency module guarantees the relationship among segments from different videos to be consistent. These two modules are complementary to each other by combining both intra-video and inter-video consistency. Experimental results show that the proposed CRC strategy can consistently improve the performance of existing weakly supervised methods, including click-level supervised methods (e.g., LACP Lee et al., 2021), video-level supervised methods (e.g., DELU Chen et al., 2022) and unsupervised methods (e.g., BaS-Net Lee et al., 2020), verifying the generality and effectiveness of the proposed method.

Similar Papers
  • Research Article
  • Cite Count Icon 14
  • 10.1109/tcsvt.2021.3104226
Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection
  • May 1, 2022
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Yaosen Chen + 5 more

Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.

  • Research Article
  • Cite Count Icon 7
  • 10.3390/app8101924
Temporal Action Detection in Untrimmed Videos from Fine to Coarse Granularity
  • Oct 15, 2018
  • Applied Sciences
  • Guangle Yao + 3 more

Temporal action detection in long, untrimmed videos is an important yet challenging task that requires not only recognizing the categories of actions in videos, but also localizing the start and end times of each action. Recent years, artificial neural networks, such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) improve the performance significantly in various computer vision tasks, including action detection. In this paper, we make the most of different granular classifiers and propose to detect action from fine to coarse granularity, which is also in line with the people’s detection habits. Our action detection method is built in the ‘proposal then classification’ framework. We employ several neural network architectures as deep information extractor and segment-level (fine granular) and window-level (coarse granular) classifiers. Each of the proposal and classification steps is executed from the segment to window level. The experimental results show that our method not only achieves detection performance that is comparable to that of state-of-the-art methods, but also has a relatively balanced performance for different action categories.

  • Research Article
  • Cite Count Icon 3
  • 10.1109/tmm.2022.3163459
Superframe-Based Temporal Proposals for Weakly Supervised Temporal Action Detection
  • Jan 1, 2023
  • IEEE Transactions on Multimedia
  • Bairong Li + 4 more

The weakly supervised Temporal Action Detection (TAD) by using the video-level annotations can lighten the burden of labor consumption. However, the current methods for weakly supervised TAD do not take full advantages of the short-term consistency between consecutive frames and the long-term continuity inside an action, resulting in less accurate detecting boundaries of actions in untrimmed videos. In this paper, the SuperFrame-based Temporal Proposal (SFTP) is proposed, in which superframes are formed for representing a series of consecutive frames with high temporal consistency and their features are pooled from the features of frames through the integration function. Then, the temporal proposal is built based on the multiple consecutive superframes and the features of all proposals are generated from a pyramidal feature hierarchy. This hierarchy consists of the designed Structured Outer-Inner Context (SOIC) features formed from superframe features and is able to explicitly characterize the temporal continuity inside a proposal. Furthermore, a novel Scale-Wise Normalization Strategy (SWNS) is proposed to identify proposals, which can effectively detect multiple actions with different duration in one untrimmed video. Extensive experiments are conducted on two public datasets: THUMOS14 and ActivityNet1.2 for performance evaluation. Our experimental results have demonstrated that the proposed approach is able to detect the boundaries of actions more effectively and obtain competitive mAP (mean average precision) compared with other approaches.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icpr.2018.8545487
Temporal Action Detection by Joint Identification-Verification
  • Aug 1, 2018
  • Wen Wang + 4 more

Temporal action detection aims at not only recognizing action category but also detecting start time and end time for each action instance in an untrimmed video. The key challenge of this task is to accurately classify the action and determine the temporal boundaries of each action instance. In temporal action detection benchmark: THUMOS 2014, large variations exist in the same action category while many similarities exist in different action categories, which always limit the performance of temporal action detection. To address this problem, we propose to use joint Identification-Verification network to reduce the intra-action variations and enlarge inter-action differences. The joint Identification-Verification network is a siamese network based on 3D ConvNets, which can simultaneously predict the action categories and the similarity scores for the input pairs of video proposal segments. Extensive experimental results on the challenging THUMOS 2014 dataset demonstrate the effectiveness of our proposed method compared to the existing state-of-art methods for temporal action detection in untrimmed videos.

  • Conference Article
  • Cite Count Icon 6
  • 10.1145/3460426.3463643
Few-Shot Action Localization without Knowing Boundaries
  • Aug 24, 2021
  • Ting-Ting Xie + 3 more

Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -- either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only class labels are known for each video. In this paper, we go a step further and show that it is possible to learn to localize actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with only class label annotation (some trimmed and some weakly annotated untrimmed ones) are available for training; with no overlap between the classes used during training and testing. To do so, we propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos (trimmed or untrimmed), and uses them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes. The TCAMs serve as temporal attention mechanisms to extract video-level representations of untrimmed videos, and to temporally localize actions at test time. To the best of our knowledge, we are the first to propose a weakly-supervised, one/few-shot action localization network that can be trained in an end-to-end fashion. Experimental results on THUMOS14 and ActivityNet1.2 datasets, show that our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.

  • Research Article
  • Cite Count Icon 12
  • 10.1109/lsp.2018.2888758
End-to-End Temporal Action Detection Using Bag of Discriminant Snippets
  • Feb 1, 2019
  • IEEE Signal Processing Letters
  • Fiza Murtaza + 3 more

Detecting human actions in long untrimmed videos is a challenging problem. Existing temporal-action detection methods have difficulties in finding the precise starting and ending times of the actions in untrimmed videos. In this letter, we propose a temporal-action detection framework that can detect multiple actions in an end-to-end manner, based on a Bag of Discriminant Snippets (BoDS). BoDS is based on the observation that multiple actions and the background classes have similar snippets, which cause incorrect classification of action regions and imprecise boundaries. We solve this issue by finding the key-snippets from the training data of each class and compute their discriminative power, which is used in BoDS encoding. During testing of an untrimmed video, we find the BoDS representation for multiple candidate proposals and find their class label based on a majority voting scheme. We test BoDS on the Thumos14 and ActivityNet datasets and obtain state-of-the-art results. For the sports subset of ActivityNet dataset, we obtain a mean Average Precision (mAP) value of 29% at 0.7 temporal Intersection over Union (tIoU) threshold. For the Thumos14 dataset, we obtain a significant gain in terms of mAP, i.e., improving from 20.8% to 31.6% at tIoU = 0.7.

  • Book Chapter
  • 10.1007/978-3-030-72610-2_25
Human Action Recognition for Boxing Training Simulator
  • Jan 1, 2021
  • Anton Broilovskiy + 1 more

Computer vision technologies are widely used in sports to control the quality of training. However, there are only a few approaches to recognizing the punches of a person engaged in boxing training. All existing approaches have used manual feature selection and trained on insufficient datasets. We introduce a new approach for recognizing actions in an untrimmed video based on three stages: removing frames without actions, action localization and action classification. Furthermore, we collected a sufficient dataset that contains five classes in total represented by more than 1000 punches in total. On each stage, we compared existing approaches and found the optimal model that allowed us to recognize actions in untrimmed videos with an accuracy 87%.

  • Research Article
  • Cite Count Icon 103
  • 10.1007/s11263-019-01211-2
Temporal Action Detection with Structured Segment Networks
  • Aug 28, 2019
  • International Journal of Computer Vision
  • Yue Zhao + 5 more

This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of actions. SSN models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and precise localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end manner. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping is devised to generate high quality action proposals. We further study the importance of the decomposed discriminative model and discover a way to achieve similar accuracy using a single classifier, which is also complementary with the original SSN design. On two challenging benchmarks, THUMOS’14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.

  • Conference Article
  • Cite Count Icon 864
  • 10.1109/iccv.2017.317
Temporal Action Detection with Structured Segment Networks
  • Oct 1, 2017
  • Yue Zhao + 5 more

Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 14
  • 10.1109/access.2021.3110973
ABN: Agent-Aware Boundary Networks for Temporal Action Proposal Generation
  • Jan 1, 2021
  • IEEE Access
  • Khoa Vo + 5 more

Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet plays an important role in many tasks of video analysis and understanding. Despite the great achievement in TAPG, most existing works ignore the human perception of interaction between agents and the surrounding environment by applying a deep learning model as a black-box to the untrimmed videos to extract video visual representation. Therefore, it is beneficial and potentially improve the performance of TAPG if we can capture these interactions between agents and the environment. In this paper, we propose a novel framework named Agent-Aware Boundary Network (ABN), which consists of two sub-networks (i) an Agent-Aware Representation Network to obtain both agent-agent and agents-environment relationships in the video representation, and (ii) a Boundary Generation Network to estimate the confidence score of temporal intervals. In the Agent-Aware Representation Network, the interactions between agents are expressed through local pathway, which operates at a local level to focus on the motions of agents whereas the overall perception of the surroundings are expressed through global pathway, which operates at a global level to perceive the effects of agents-environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks (i.e C3D, SlowFast and Two-Stream) show that our proposed ABN robustly outperforms state-of-the-art methods regardless of the employed backbone network on TAPG. We further examine the proposal quality by leveraging proposals generated by our method onto temporal action detection (TAD) frameworks and evaluate their detection performances. The source code can be found in this URL https://github.com/vhvkhoa/TAPG-AgentEnvNetwork.git.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/icassp39728.2021.9414253
SRF-Net: Selective Receptive Field Network for Anchor-Free Temporal Action Detection
  • Jun 6, 2021
  • Ranyu Ning + 2 more

Temporal action detection (TAD) is a challenging task which aims to temporally localize and recognize the human action in untrimmed videos. Current mainstream one-stage TAD approaches localize and classify action proposals relying on pre-defined anchors, where the location and scale for action instances are set by designers. Obviously, such an anchor-based TAD method limits its generalization capability and will lead to performance degradation when videos contain rich action variation. In this study, we explore to remove the requirement of pre-defined anchors for TAD methods. A novel TAD model termed as Selective Receptive Field Network (SRF-Net) is developed, in which the location offsets and classification scores at each temporal location can be directly estimated in the feature map and SRF-Net is trained in an end-to-end manner. Innovatively, a building block called Selective Receptive Field Convolution (SRFC) is dedicatedly designed which is able to adaptively adjust its receptive field size according to multiple scales of input information at each temporal location in the feature map. Extensive experiments are conducted on the THUMOS14 dataset, and superior results are reported comparing to state-of-the-art TAD approaches.

  • Research Article
  • Cite Count Icon 53
  • 10.1109/tpami.2022.3193611
Deep Learning-Based Action Detection in Untrimmed Videos: A Survey.
  • Apr 1, 2023
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Elahe Vahdani + 1 more

Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This article provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this article reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Action detection in online setting is also reviewed where the goal is to detect actions in each frame without considering any future context in a live video stream. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.

  • Conference Article
  • 10.1145/3278198.3278224
Temporal Action Detection with Long Action Seam Mechanism
  • Sep 19, 2018
  • Yiheng Cai + 2 more

Temporal action detection is a hot topic in action recognition field recently. In this paper, we propose a novel framework that can extract action segments from untrimmed videos, meanwhile predict the action category. In general, we introduce a cascaded pipeline that could address temporal boundary at first, included feature extraction and temporal proposal model. Then, all video clips obtained above are sent to the action category classifier to detect action class. Furthermore, since that various action lengths result in inaccurate accuracy, especially on long action clip. We targeted present a novel long action seam mechanism dealing with the inaccurate location of long action. Therefore, our method is more sensitive to long action boundaries, the long action seam mechanism improves the performance of our algorithm obviously. Our algorithm performs improved accuracy by increasing mAP from 25.6 to 25.9 at threshold 0.5, on standard temporal action detection dataset, THUMOS14. it is indicated that our algorithm has particularly outstanding performance on long action detection.

  • Conference Article
  • Cite Count Icon 54
  • 10.1109/wacv48630.2021.00301
PDAN: Pyramid Dilated Attention Network for Action Detection
  • Jan 1, 2021
  • Rui Dai + 5 more

Handling long and complex temporal information is an important challenge for action detection tasks. This challenge is further aggravated by densely distributed actions in untrimmed videos. Previous action detection methods fail in selecting the key temporal information in long videos. To this end, we introduce the Dilated Attention Layer (DAL). Compared to previous temporal convolution layer, DAL allocates attentional weights to local frames in the kernel, which enables it to learn better local representation across time. Furthermore, we introduce Pyramid Dilated Attention Network (PDAN) which is built upon DAL. With the help of multiple DALs with different dilation rates, PDAN can model short-term and long-term temporal relations simultaneously by focusing on local segments at the level of low and high temporal receptive fields. This property enables PDAN to handle complex temporal relations between different action instances in long untrimmed videos. To corroborate the effectiveness and robustness of our method, we evaluate it on three densely annotated, multi-label datasets: Mul-tiTHUMOS, Charades and Toyota Smarthome Untrimmed (TSU) dataset. PDAN is able to outperform previous state-of-the-art methods on all these datasets."Time abides long enough for those who make use of it.

  • Abstract
  • 10.1016/s0021-9290(07)70004-0
Human locomotion
  • Jan 1, 2007
  • Journal of Biomechanics
  • Alberto Minetti

Human locomotion

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon