End-to-End Referring Video Object Segmentation with Multimodal Transformers

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps. As such, it simplifies the RVOS pipeline considerably compared to existing methods. Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on the A2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76 frames per second. In addition, we report strong results on the public validation set of Refer-YouTube-VOS, a more challenging RVOS dataset that has yet to receive the attention of researchers. The code to reproduce our experiments is avail-able at https://github.com/mttr2021/MTTR.

Similar Papers
  • Research Article
  • Cite Count Icon 60
  • 10.1145/3009906
Computer Vision and Natural Language Processing
  • Dec 12, 2016
  • ACM Computing Surveys
  • Peratham Wiriyathammabhum + 3 more

Integrating computer vision and natural language processing is a novel interdisciplinary field that has received a lot of attention recently. In this survey, we provide a comprehensive introduction of the integration of computer vision and natural language processing in multimedia and robotics applications with more than 200 key references. The tasks that we survey include visual attributes, image captioning, video captioning, visual question answering, visual retrieval, human-robot interaction, robotic actions, and robot navigation. We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. We also present a unified view for the field and propose possible future directions.

  • Research Article
  • Cite Count Icon 11
  • 10.31590/ejosat.1013329
A Benchmark for Feature-injection Architectures in Image Captioning
  • Dec 6, 2021
  • European Journal of Science and Technology
  • Rumeysa Keski̇n + 4 more

Describing an image with a grammatically and semantically correct sentence, known as image captioning, has been improved significantly with recent advances in computer vision (CV) and natural language processing (NLP) communities. The integration of these communities leads to the development of feature-injection architectures, which define how extracted features are used in captioning. In this paper, a benchmark of feature-injection architectures that utilize CV and NLP techniques is reported for encoder-decoder based captioning. Benchmark evaluations include Inception-v3 convolutional neural network to extract image features in the encoder while the feature-injection architectures such as init-inject, pre-inject, par-inject and merge are applied with a multi-layer gated recurrent unit (GRU) to generate captions in the decoder. Architectures have been evaluated extensively on the MSCOCO dataset across eight performance metrics. It has been concluded that the init-inject architecture with 3-layer GRU outperforms the other architectures in terms of captioning accuracy.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41597-023-02653-7
DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding
  • Nov 7, 2023
  • Scientific Data
  • Kehinde Ajayi + 7 more

Recent advances in computer vision (CV) and natural language processing have been driven by exploiting big data on practical applications. However, these research fields are still limited by the sheer volume, versatility, and diversity of the available datasets. CV tasks, such as image captioning, which has primarily been carried out on natural images, still struggle to produce accurate and meaningful captions on sketched images often included in scientific and technical documents. The advancement of other tasks such as 3D reconstruction from 2D images requires larger datasets with multiple viewpoints. We introduce DeepPatent2, a large-scale dataset, providing more than 2.7 million technical drawings with 132,890 object names and 22,394 viewpoints extracted from 14 years of US design patent documents. We demonstrate the usefulness of DeepPatent2 with conceptual captioning. We further provide the potential usefulness of our dataset to facilitate other research areas such as 3D image reconstruction and image retrieval.

  • Research Article
  • 10.55041/ijsrem39932
Smart Vision Assistant Glasses for Visually Impaired Persons
  • Dec 23, 2024
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Thaseen Bhashith + 4 more

In recent years, advancements in computer vision and natural language processing (NLP) have led to the development of highly accessible and assistive technologies. This technology leverages these advancements to create a system that provides real-time object detection and text recognition capabilities, integrated with speech synthesis for audio feedback. The system employs the YOLO (You Only Look Once) algorithm for fast and accurate object detection and an Optical Character Recognition (OCR) module for extracting text from captured images. Text-to-speech (TTS) technology is incorporated to deliver audio outputs, ensuring accessibility for users, especially those with visual impairments. This decentralized system operates on user commands and does not rely on cloud processing, ensuring faster response times and data privacy. By combining computer vision and NLP, this paper offers a cost-effective and portable solution for real-time assistive applications, empowering users to interact effectively with their surroundings through visual data processing and auditory feedback. Key Words: Computer Vision, YOLOv8, OCR (Optical Character Recognition),

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/ijcnn48605.2020.9206679
Component Analysis for Visual Question Answering Architectures
  • Jul 1, 2020
  • Camila Kolling + 2 more

Recent research advances in Computer Vision and Natural Language Processing have introduced novel tasks that are paving the way for solving AI-complete problems. One of those tasks is called Visual Question Answering (VQA). This system takes an image and a free-form, open-ended natural-language question about the image, and produce a natural language answer as the output. Such a task has drawn great attention from the scientific community, which generated a plethora of approaches that aim to improve the VQA predictive accuracy. Most of them comprise three major components: (i) independent representation learning of images and questions; (ii) feature fusion so the model can use information from both sources to answer visual questions; and (iii) the generation of the correct answer in natural language. With so many approaches being recently introduced, it became unclear the real contribution of each component for the ultimate performance of the model. The main goal of this paper is to provide a comprehensive analysis regarding the impact of each component in VQA models. Our extensive set of experiments cover both visual and textual elements, as well as the combination of these representations in form of fusion and attention mechanisms. Our major contribution is to identify core components for training VQA models so as to maximize their predictive performance.

  • Research Article
  • 10.47363/jaicc/2023(2)131
Literature Review: Recent Advances in Computer Vision and Language AI
  • Sep 30, 2023
  • Journal of Artificial Intelligence & Cloud Computing
  • Suresh Babu Rajasekaran

This comprehensive literature review examines the latest breakthroughs in computer vision and natural language processing (NLP), two rapidly evolving fields with applications across search, human-computer interaction, robotics, and more. It synthesizes key findings, trends, limitations, and open challenges from cutting-edge research at their intersection. The dramatic progress driven by deep neural networks is analysed in depth, along with issues like generalization, context handling, reasoning, uncertainty, and human-centric evaluation. Although remarkable advances have been made, especially in computer vision, core problems remain to be addressed. This review provides a thorough overview of the state-of-the-art, reflecting the most recent innovations, and promising future directions in this dynamic research domain.

  • Conference Article
  • Cite Count Icon 39
  • 10.1145/3341105.3373906
Lightweight network architecture for real-time action recognition
  • Mar 30, 2020
  • Alexander Kozlov + 2 more

In this work we present a new efficient approach to Human Action Recognition called Video Transformer Network (VTN). It leverages the latest advances in Computer Vision and Natural Language Processing and applies them to video understanding. The proposed method allows us to create lightweight CNN models that achieve high accuracy and real-time speed using just an RGB mono camera and general purpose CPU. Furthermore, we explain how to improve accuracy by distilling from multiple models with different modalities into a single model. We conduct a comparison with state-of-the-art methods and show that our approach performs on par with most of them on famous Action Recognition datasets. We benchmark the inference time of the models using the modern inference framework and argue that our approach compares favorably with other methods in terms of speed/accuracy trade-off, running at 56 frames per second (FPS) on CPU. The models and the training code are available1.

  • Research Article
  • 10.30574/wjarr.2025.26.2.1705
A survey on image captioning methods
  • May 30, 2025
  • World Journal of Advanced Research and Reviews
  • Kavitha Soppari + 3 more

Image captioning is a task that Involves Natural Language Processing concepts to recognize the context of an image and describe them in a natural language like English. It requires good knowledge of Deep learning. Python, working on Jupyter notebooks, Keras library, Numpy, and Natural language processing It is a Python based project where we will use deep learning techniques of Convolutional Neural Networks and a type of Recurrent Neural Network (LSTM) together. The biggest challenge is most definitely being able to create a description that must capture not only the objects contained in an image, but also express how these objects relate to each other. Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing here, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. It could have great impact, for instance by helping visually impaired people better understand the content of images on the web.

  • Conference Article
  • 10.1145/3332167.3357104
Say and Find it
  • Oct 14, 2019
  • Taeyong Kim + 4 more

Recent advances in computer vision and natural language processing using deep neural networks (DNNs) have enabled rich and intuitive multimodal interfaces. However, research on intelligent assistance systems for persons with visual impairment has not been well explored. In this work, we present an interactive object recognition and guidance interface based on multimodal interaction for blind and partially sighted people using an embedded mobile device. We demonstrate that the proposed solution using DNNs can effectively assist visually impaired people. We believe that this work will provide new and helpful insights for designing intelligent assistance systems in the future.

  • Conference Article
  • Cite Count Icon 7240
  • 10.1109/cvpr.2015.7298935
Show and tell: A neural image caption generator
  • Jun 1, 2015
  • Oriol Vinyals + 3 more

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

  • Research Article
  • Cite Count Icon 1031
  • 10.1109/tpami.2016.2587640
Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge.
  • Jul 7, 2016
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Oriol Vinyals + 3 more

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. Finally, given the recent surge of interest in this task, a competition was organized in 2015 using the newly released COCO dataset. We describe and analyze the various improvements we applied to our own baseline and show the resulting performance in the competition, which we won ex-aequo with a team from Microsoft Research.

  • Book Chapter
  • Cite Count Icon 5
  • 10.1002/9781394219230.ch2
AI Applications – Computer Vision and Natural Language Processing
  • Nov 22, 2024
  • Balakrishnan Chinnaiyan + 3 more

Artificial intelligence (AI) applications in computer vision and natural language processing (NLP) have made major advances in recent years, challenging a number of sectors and areas. This multidisciplinary topic combines NLP, which examines the study of human language, and computer vision, which concentrates on the understanding of visual data. This study examines the wide range of applications that are included within this convergence, highlighting the revolutionary potential of AI technology. AI has made it possible to make significant advances in autonomous systems, object identification, and image recognition in the field of computer vision. These developments have stimulated innovation and increased efficiency, revolutionizing sectors including healthcare, autonomous vehicles, and security. Meanwhile, AI-driven advances in NLP have produced strong language models that can produce, comprehend, and translate text. These approaches have been utilized to improve accessibility and efficiency of communication in chatbots, sentiment analysis, and language translation services. This chapter explores the basic ideas and advancements in these two fields, emphasizing the opportunities and novel challenges that arise from integrating computer vision and NLP. Additionally covered are data privacy, ethical issues, and the possibility of prejudice in AI applications. The study also highlights the ongoing need for these fields' advancement and investigation in order to solve real-world problems and fully utilize AI's potential in the computer vision and NLP industries.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.5194/isprs-archives-xlviii-2-w10-2025-101-2025
Evaluation of Depth Anything Models for Satellite-Derived Bathymetry
  • Jul 7, 2025
  • The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
  • Esra Günaydın + 3 more

Abstract. The emergence of foundation models has driven major advancements in computer vision and natural language processing, primarily due to their strong zero-shot and few-shot capabilities powered by large-scale, diverse datasets. While earlier approaches used supervised datasets, their limited scene diversity did not perform well in unseen environments. To overcome these limitations, recent works have leveraged unlabeled monocular images, which can be automatically labeled using pre-trained models. One model can be shown as Depth Anything, which demonstrated robust zero-shot performance across diverse scenarios, with Depth Anything V2 further improving accuracy. In this study, the performance of Depth Anything V1 and V2 models was evaluated in satellite-derived bathymetry using Sentinel 2 satellite imagery. The accuracy of these predicted depth maps was evaluated by comparing them with bathymetric data obtained from the National Oceanic and Atmospheric Administration’s (NOAA) National Centers for Environmental Information (NCEI) as the ground truth. The results show that the correlation between Depth Anything V1 predictions and NOAA NCEI data was 56.69%, while the correlation for Depth Anything V2 reached 84.54%. The predicted depth maps were also scaled to obtain Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). The RMSE and MAE values for Depth Anything V1 are 0.4135 m and 0.34 m, respectively, while the RMSE and MAE values for V2 are 0.2681 m and 0.2089 m, respectively. This improvement shows the capability of Depth Anything V2 in estimating underwater terrain from monocular satellite imagery, which also demonstrates its potential for cost-effective bathymetric mapping in remote sensing applications. In addition to deep learning-based approaches applied in the test area, a satellite-derived depth map was also generated using the classical band ratio method. Compared with reference bathymetric data, the correlation coefficient, RMSE, and MAE were found to be 38.20%, 0.4639m, and 0.3746m, respectively.

  • Research Article
  • 10.32913/mic-ict-research.v.n0.1352
Distillation-Centric Approaches in Visual Question Answering with Mixture of Experts
  • May 20, 2025
  • ICT Research
  • Huy Huynh Hoang + 2 more

Recent advancements in computer vision and natural language processing were applied to the Visual Question Answering task. Nonetheless, a significant proportion of models exhibiting high accuracy possess extensive architectural components. This has a significant impact on the process of bringing the technology to practical applications such as assistive devices for the blind and visually impaired, and other related fields. Our research focuses on compressing the Visual Question Answering model on the Vietnamese dataset by utilizing the knowledge distillation method. Furthermore, in order to enhance precision, we have also developed a Mixture of ViVQA Experts system that will adapt to each type of question for improving accuracy while increasing only a few parameters and not wasting time retraining the entire system from scratch. With a total of 204M parameters, this approach has reduced the size by 24.51% compared to the original model while only reducing accuracy by 6.59\% on the overall test set. More specifically, we have made accuracy improvements on each question type: "number" increased by 1.35% and "color" increased by 0.48\% compared to our distillation model. The code and pretrained models are available at: anonymous.

  • Conference Article
  • 10.1145/3701716.3717738
Explainable Vision-Language Model for Personalized Medicine
  • May 8, 2025
  • Md Sarwar Kamal + 2 more

Recent advancements in computer vision (CV) and natural language processing (NLP) have led to the emergence of Vision-Language Models (VLMs), which excel in interpreting complex multimodal information by seamlessly integrating visual and textual data. This paper proposes a novel, interpretable framework that combines VLMs with specific mathematical transforms-namely, the Fast Fourier Transform (FFT) for efficient computation of frequency domains, and the Bilateral Laplace Transform for enhanced stability analysis in nonlinear systems-to enhance drug discovery and personalized medicine. The interpretable application of FFT identifies periodic patterns in temporal gene expression data from genes such as TP53 and EGFR, crucial for understanding circadian influences on drug metabolism. The Bilateral Laplace Transform, also applied in an interpretable manner, assesses system stability and response under various therapeutic interventions, focusing on genes like BRCA1 and PTEN for short-term treatment outcomes. This integrated model leverages the strengths of VLMs to synthesize and contextualize the transformed data, providing a robust and interpretable analytical tool for predicting individual drug responses and optimizing treatment strategies. Validation of the proposed framework on multimodal datasets comprising clinical imaging, genomic data, and textual descriptions confirms its potential in significantly improving the precision of personalized treatment plans. The outcomes of this research advances our understanding of complex drug interactions within the human body and also pave the way for developing a user-friendly and interpretable tool that assists clinicians in real-time decision-making, ultimately enhancing patient outcomes in clinical settings.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant