Multi-modal transformer with language modality distillation for early pedestrian action anticipation

Nada Osman,Guglielmo Camporese,Lamberto Ballan

doi:10.1016/j.cviu.2024.104144

Abstract

Language-vision integration has become an increasingly popular research direction within the computer vision field. In recent years, there has been a growing recognition of the importance of incorporating linguistic information into visual tasks, particularly in domains such as action anticipation. This integration allows anticipation models to leverage textual descriptions to gain deeper contextual understanding, leading to more accurate predictions. In this work, we focus on pedestrian action anticipation, where the objective is the early prediction of pedestrians’ future actions in urban environments. Our method relies on a multi-modal transformer model that encodes past observations and produces predictions at different anticipation times, employing a learned mask technique to filter out redundancy in the observed frames. Instead of relying solely on visual cues extracted from images or videos, we explore the impact of integrating textual information in enriching the input modalities of our pedestrian action anticipation model. We investigate various techniques for generating descriptive captions corresponding to input images, aiming to enhance the anticipation performance. Evaluation results on available public benchmarks demonstrate the effectiveness of our method in improving the prediction performance at different anticipation times compared to previous works. Additionally, incorporating the language modality in our anticipation model proved significant improvement, reaching a 29.5% increase in the F1 score at 1-second anticipation and a 16.66% increase at 4-second anticipation. These results underscore the potential of language-vision integration in advancing pedestrian action anticipation in complex urban environments.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-modal transformer with language modality distillation for early pedestrian action anticipation

Abstract

Talk to us

Similar Papers

More From: Computer Vision and Image Understanding

Lead the way for us

Journal: Computer Vision and Image Understanding	Publication Date: Sep 10, 2024
License type: cc-by-nc-nd

Similar Papers

Driving Decision-Making Analysis of Car-Following for Autonomous Vehicle Under Complex Urban Environment
Xue-Mei Chen ... Yi-Song Miao
-
Xue-Mei Chen, et. al.Xue-Mei Chen ... Yi-Song Miao
01 Dec 2016
01 Dec 2016

Harvesting wind energy from the complex urban environment using CFD approach
Bavin Loganathan ... Harun Chowdhury
-
Bavin Loganathan, et. al.Bavin Loganathan ... Harun Chowdhury
01 Jan 2021
01 Jan 2021

TAMformer: Multi-Modal Transformer with Learned Attention Mask for Early Intent Prediction
Nada Osman ... Guglielmo Camporese
-
Nada Osman, et. al.Nada Osman ... Guglielmo Camporese
04 Jun 2023
04 Jun 2023

Implementing context aware scenarios to enable smart health in complex urban environments
Erik Aguirre ... Victoria Ramos
-
Erik Aguirre, et. al.Erik Aguirre ... Victoria Ramos
01 Jun 2014
01 Jun 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-modal transformer with language modality distillation for early pedestrian action anticipation

Abstract

Talk to us

Similar Papers

More From: Computer Vision and Image Understanding