FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification

Pu Jin,Yuansheng Hua,Gui-Song Xia,Xiao Xiang Zhu,Lichao Mou

doi:10.1109/tgrs.2022.3150917

Abstract

Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current researches mainly focus on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies which are important for describing complicated dynamics. In this paper, we propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multi-scale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a two-pathway architecture: (1) a holistic representation pathway to learn a general feature of both frame appearances and shortterm temporal variations and (2) a temporal relation pathway to capture multi-scale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterwards, a novel fusion module is proposed to spatiotemporal integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at https://gitlab.lrz.de/ai4eo/reasoning/futh-net.

Highlights

B Y the virtue of low-cost, real-time, and high-resolution data acquisition capacity, unmanned aerial vehicles (UAVs) can be exploited for a wide range of applications [1]– [17] in the field of remote sensing, such as object tracking and surveillance [5]–[10], traffic flow monitoring [11]–[14], and precision agriculture [15]–[17]
For spatiotemporally fusing two features from two pathways, we further present a novel fusion module in which the multi-scale temporal relations are leveraged to refine the temporal features in the holistic representation
Motivated by conditional normalization [57], [58], we present a novel fusion module where the two features are spatiotemporally registered by modulating the holistic features according to temporal relations

Summary

Introduction

B Y the virtue of low-cost, real-time, and high-resolution data acquisition capacity, unmanned aerial vehicles (UAVs) can be exploited for a wide range of applications [1]– [17] in the field of remote sensing, such as object tracking and surveillance [5]–[10], traffic flow monitoring [11]–[14], and precision agriculture [15]–[17]. There is an escalating demand for automatically parsing aerial videos, because it is unrealistic for humans to screen such big data and understand their contents. Feature learning and representation from videos is crucial for this task. Compared to a sequence of remote sensing images in which the temporal information is limited due to relatively long satellite revisit periods, an overhead video is able to deliver more fine-grained temporal dynamics that are essential for describing complex events. Moving from image recognition to video classification, much effort has been made to learning spatiotemporal feature representations

Methods

Results

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society	Publication Date: Jan 1, 2022
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society

Lead the way for us

Similar Papers

Temporal Relations Matter: A Two-Pathway Network for Aerial Video Recognition
Pu Jin ... Xiao Xiang Zhu
-
Pu Jin, et. al.Pu Jin ... Xiao Xiang Zhu
11 Jul 2021
11 Jul 2021

Tracking and Speed Estimation of Ground Vehicles Using Aerial-view Videos
Dongyang Zhao ... Yuqing Chen
-
Dongyang Zhao, et. al.Dongyang Zhao ... Yuqing Chen
01 Sep 2020
01 Sep 2020

A Comparison Between Various Human Detectors and CNN-Based Feature Extractors for Human Activity Recognition via Aerial Captured Video Sequences
Nouar Aldahoul ... Aznul Qalid Md Sabri
IEEE access : practical innovations, open solutions | VOL. 10
Nouar Aldahoul, et. al.Nouar Aldahoul ... Aznul Qalid Md Sabri
01 Jan 2021
IEEE access : practical innovations, open solutions | VOL. 10

Challenges and Limitations in Human Action Recognition on Unmanned Aerial Vehicles: A Comprehensive Survey
Nashwan Adnan Othman ... Ilhan Aydin
Traitement du Signal | VOL. 38
Nashwan Adnan Othman, et. al.Nashwan Adnan Othman ... Ilhan Aydin
31 Oct 2021
Traitement du Signal | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society