Image-text Pairs Research Articles

Background Generating radiologic findings from chest radiographs is pivotal in medical image analysis. The emergence of OpenAI's generative pretrained transformer, GPT-4 with vision (GPT-4V), has opened new perspectives on the potential for automated image-text pair generation. However, the application of GPT-4V to real-world chest radiography is yet to be thoroughly examined. Purpose To investigate the capability of GPT-4V to generate radiologic findings from real-world chest radiographs. Materials and Methods In this retrospective study, 100 chest radiographs with free-text radiology reports were annotated by a cohort of radiologists, two attending physicians and three residents, to establish a reference standard. Of 100 chest radiographs, 50 were randomly selected from the National Institutes of Health (NIH) chest radiographic data set, and 50 were randomly selected from the Medical Imaging and Data Resource Center (MIDRC). The performance of GPT-4V at detecting imaging findings from each chest radiograph was assessed in the zero-shot setting (where it operates without prior examples) and few-shot setting (where it operates with two examples). Its outcomes were compared with the reference standard with regards to clinical conditions and their corresponding codes in the International Statistical Classification of Diseases, Tenth Revision (ICD-10), including the anatomic location (hereafter, laterality). Results In the zero-shot setting, in the task of detecting ICD-10 codes alone, GPT-4V attained an average positive predictive value (PPV) of 12.3%, average true-positive rate (TPR) of 5.8%, and average F1 score of 7.3% on the NIH data set, and an average PPV of 25.0%, average TPR of 16.8%, and average F1 score of 18.2% on the MIDRC data set. When both the ICD-10 codes and their corresponding laterality were considered, GPT-4V produced an average PPV of 7.8%, average TPR of 3.5%, and average F1 score of 4.5% on the NIH data set, and an average PPV of 10.9%, average TPR of 4.9%, and average F1 score of 6.4% on the MIDRC data set. With few-shot learning, GPT-4V showed improved performance on both data sets. When contrasting zero-shot and few-shot learning, there were improved average TPRs and F1 scores in the few-shot setting, but there was not a substantial increase in the average PPV. Conclusion Although GPT-4V has shown promise in understanding natural images, it had limited effectiveness in interpreting real-world chest radiographs. © RSNA, 2024 Supplemental material is available for this article.

Tomato leaf disease control in the field of smart agriculture urgently requires attention and reinforcement. This paper proposes a method called LAFANet for image-text retrieval, which integrates image and text information for joint analysis of multimodal data, helping agricultural practitioners to provide more comprehensive and in-depth diagnostic evidence to ensure the quality and yield of tomatoes. First, we focus on six common tomato leaf disease images and text descriptions, creating a Tomato Leaf Disease Image-Text Retrieval Dataset (TLDITRD), introducing image-text retrieval into the field of tomato leaf disease retrieval. Then, utilizing ViT and BERT models, we extract detailed image features and sequences of textual features, incorporating contextual information from image-text pairs. To address errors in image-text retrieval caused by complex backgrounds, we propose Learnable Fusion Attention (LFA) to amplify the fusion of textual and image features, thereby extracting substantial semantic insights from both modalities. To delve further into the semantic connections across various modalities, we propose a False Negative Elimination-Adversarial Negative Selection (FNE-ANS) approach. This method aims to identify adversarial negative instances that specifically target false negatives within the triplet function, thereby imposing constraints on the model. To bolster the model's capacity for generalization and precision, we propose Adversarial Regularization (AR). This approach involves incorporating adversarial perturbations during model training, thereby fortifying its resilience and adaptability to slight variations in input data. Experimental results show that, compared with existing ultramodern models, LAFANet outperformed existing models on TLDITRD dataset, with top1, top5, and top10 reaching 83.3% and 90.0%, and top1, top5, and top10 reaching 80.3%, 93.7%, and 96.3%. LAFANet offers fresh technical backing and algorithmic insights for the retrieval of tomato leaf disease through image-text correlation.

Image-text Pairs Research Articles

Related Topics

Articles published on Image-text Pairs

RD-FGM: A novel model for high-quality and diverse food image generation and ingredient classification

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model

Learning hierarchical embedding space for image-text matching

Clustering swap prediction for image-text pre-training

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

A multimodal transfer learning framework for the classification of disaster-related social media images

Evaluating GPT-V4 (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs.

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

A self-supervised framework for cross-modal search in histopathology archives using scale harmonization

High-Accuracy Tomato Leaf Disease Image-Text Retrieval Method Utilizing LAFANet.

Prompt-Enhanced Generation for Multimodal Open Question Answering

Attr4Vis: Revisiting Importance of Attribute Classification in Vision-Language Models for Video Recognition

Heterogeneous Graph Fusion Network for cross-modal image-text retrieval

A face retrieval technique combining large models and artificial neural networks

Image Captioning with Multi-Context Synthetic Data

GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

CLIM: Contrastive Language-Image Mosaic for Region Representation

Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Image-text Pairs Research Articles

Related Topics

Articles published on Image-text Pairs

RD-FGM: A novel model for high-quality and diverse food image generation and ingredient classification

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model

Learning hierarchical embedding space for image-text matching

Clustering swap prediction for image-text pre-training

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

A multimodal transfer learning framework for the classification of disaster-related social media images

Evaluating GPT-V4 (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs.

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

A self-supervised framework for cross-modal search in histopathology archives using scale harmonization

High-Accuracy Tomato Leaf Disease Image-Text Retrieval Method Utilizing LAFANet.

Prompt-Enhanced Generation for Multimodal Open Question Answering

Attr4Vis: Revisiting Importance of Attribute Classification in Vision-Language Models for Video Recognition

Heterogeneous Graph Fusion Network for cross-modal image-text retrieval

A face retrieval technique combining large models and artificial neural networks

Image Captioning with Multi-Context Synthetic Data

GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

CLIM: Contrastive Language-Image Mosaic for Region Representation

Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders