Image-text Pairs Research Articles

Background Generating radiologic findings from chest radiographs is pivotal in medical image analysis. The emergence of OpenAI's generative pretrained transformer, GPT-4 with vision (GPT-4V), has opened new perspectives on the potential for automated image-text pair generation. However, the application of GPT-4V to real-world chest radiography is yet to be thoroughly examined. Purpose To investigate the capability of GPT-4V to generate radiologic findings from real-world chest radiographs. Materials and Methods In this retrospective study, 100 chest radiographs with free-text radiology reports were annotated by a cohort of radiologists, two attending physicians and three residents, to establish a reference standard. Of 100 chest radiographs, 50 were randomly selected from the National Institutes of Health (NIH) chest radiographic data set, and 50 were randomly selected from the Medical Imaging and Data Resource Center (MIDRC). The performance of GPT-4V at detecting imaging findings from each chest radiograph was assessed in the zero-shot setting (where it operates without prior examples) and few-shot setting (where it operates with two examples). Its outcomes were compared with the reference standard with regards to clinical conditions and their corresponding codes in the International Statistical Classification of Diseases, Tenth Revision (ICD-10), including the anatomic location (hereafter, laterality). Results In the zero-shot setting, in the task of detecting ICD-10 codes alone, GPT-4V attained an average positive predictive value (PPV) of 12.3%, average true-positive rate (TPR) of 5.8%, and average F1 score of 7.3% on the NIH data set, and an average PPV of 25.0%, average TPR of 16.8%, and average F1 score of 18.2% on the MIDRC data set. When both the ICD-10 codes and their corresponding laterality were considered, GPT-4V produced an average PPV of 7.8%, average TPR of 3.5%, and average F1 score of 4.5% on the NIH data set, and an average PPV of 10.9%, average TPR of 4.9%, and average F1 score of 6.4% on the MIDRC data set. With few-shot learning, GPT-4V showed improved performance on both data sets. When contrasting zero-shot and few-shot learning, there were improved average TPRs and F1 scores in the few-shot setting, but there was not a substantial increase in the average PPV. Conclusion Although GPT-4V has shown promise in understanding natural images, it had limited effectiveness in interpreting real-world chest radiographs. © RSNA, 2024 Supplemental material is available for this article.

IntroductionSeveral authors have demonstrated the relevance of the therapist sensitivity to the affective expression of his client (Merten & Schwab, 2005; 150-158), as well as to his own emotional experience (Haynal-Raymond et al., 2005;142-148) in order to build a more effective therapeutic relationship, and results. An important source of information to decode the emotional expression hints is the face, and its expression (Ekman & Friesen, 1975; Russel & Fernández-Dolls, 1997;275-294). Despite common sense saying that context is relevant to understand the meaning of the emotional facial expression, the literature review shows inconsistent results.ObjectivesThe main goal of this study was to evaluate the impact of clinical context over the perception of the emotional facial expression.MethodsThis study followed a within-subjects design, and its sample consisted of 60 clinical psychologists. 21 combinations of prototypical expression images with mixed emotional signals, and clinical information texts were presented to the participants. Then their judgement on the type of emotion displayed was requested. The presentation of the text-image pairs was randomized between three conditions: consistent, and non-consistent, and neutral.ResultsThe results suggest that emotions are more easily recognized in the presence of a concordant context than a non-concordant or neutral one, and that the greater the similarity between the facial expression of the image presented and the face prototypically associated with the context, the greater the influence of the context.However, In the recognition of mixed emotional signs, there was greater recognition of signs of anger in the facial expression, as a non-dominant emotion, when in the presence of the neutral story than of the story that agreed with the dominant emotion (sadness). There was also greater recognition of sadness, as a non-dominant emotion, in the presence of a story in agreement with fear than in the presence of a neutral story. There was also a statistically significant increase in the attribution of anger to images in which it is not present and whose dominant emotion is fear, when associated with a context of aggression vs. a neutral context.It was also found that there was a significant decrease in the attribution of fear to the sadness-anger image (25%-75%) in the presence of the aggression context compared to the neutral and panic contexts.There was also a statistically significant decrease in the attribution of sadness to an image of fear in the neutral context compared to the other contexts (panic and aggression).ConclusionsIn conclusion, our study have shown an impact of context over overvaluation or the undervaluation of the emotional facial expression as well as either with prototypical expressions or the mixed emotional signals when referring to sadness, fear, and anger. Thus, mental health clinicians should consider the influence of these contexts.Disclosure of InterestNone Declared

Image-text Pairs Research Articles

Related Topics

Articles published on Image-text Pairs

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

A multimodal transfer learning framework for the classification of disaster-related social media images

Evaluating GPT-V4 (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs.

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

A self-supervised framework for cross-modal search in histopathology archives using scale harmonization

High-Accuracy Tomato Leaf Disease Image-Text Retrieval Method Utilizing LAFANet.

Prompt-Enhanced Generation for Multimodal Open Question Answering

Attr4Vis: Revisiting Importance of Attribute Classification in Vision-Language Models for Video Recognition

The impact of clinical context on the recognition of facial expressions

Heterogeneous Graph Fusion Network for cross-modal image-text retrieval

A face retrieval technique combining large models and artificial neural networks

Image Captioning with Multi-Context Synthetic Data

GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

CLIM: Contrastive Language-Image Mosaic for Region Representation

Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders

Noise-Aware Image Captioning with Progressively Exploring Mismatched Words

DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding

SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Image-text Pairs Research Articles

Related Topics

Articles published on Image-text Pairs

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

A multimodal transfer learning framework for the classification of disaster-related social media images

Evaluating GPT-V4 (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs.

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

A self-supervised framework for cross-modal search in histopathology archives using scale harmonization

High-Accuracy Tomato Leaf Disease Image-Text Retrieval Method Utilizing LAFANet.

Prompt-Enhanced Generation for Multimodal Open Question Answering

Attr4Vis: Revisiting Importance of Attribute Classification in Vision-Language Models for Video Recognition

The impact of clinical context on the recognition of facial expressions

Heterogeneous Graph Fusion Network for cross-modal image-text retrieval

A face retrieval technique combining large models and artificial neural networks

Image Captioning with Multi-Context Synthetic Data

GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

CLIM: Contrastive Language-Image Mosaic for Region Representation

Expediting Contrastive Language-Image Pretraining via Self-Distilled Encoders

Noise-Aware Image Captioning with Progressively Exploring Mismatched Words

DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding

SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning