Abstract
Language-vision integration has become an increasingly popular research direction within the computer vision field. In recent years, there has been a growing recognition of the importance of incorporating linguistic information into visual tasks, particularly in domains such as action anticipation. This integration allows anticipation models to leverage textual descriptions to gain deeper contextual understanding, leading to more accurate predictions. In this work, we focus on pedestrian action anticipation, where the objective is the early prediction of pedestrians’ future actions in urban environments. Our method relies on a multi-modal transformer model that encodes past observations and produces predictions at different anticipation times, employing a learned mask technique to filter out redundancy in the observed frames. Instead of relying solely on visual cues extracted from images or videos, we explore the impact of integrating textual information in enriching the input modalities of our pedestrian action anticipation model. We investigate various techniques for generating descriptive captions corresponding to input images, aiming to enhance the anticipation performance. Evaluation results on available public benchmarks demonstrate the effectiveness of our method in improving the prediction performance at different anticipation times compared to previous works. Additionally, incorporating the language modality in our anticipation model proved significant improvement, reaching a 29.5% increase in the F1 score at 1-second anticipation and a 16.66% increase at 4-second anticipation. These results underscore the potential of language-vision integration in advancing pedestrian action anticipation in complex urban environments.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.