Fine-grained Alignment Research Articles

Abstract The Multimodal Named Entity Recognition (MNER) task enhances the text representations and improves the accuracy and robustness of named entity recognition by leveraging visual information from images. However, previous methods have two limitations: (i) the semantic mismatch between text and image modalities makes it challenging to establish accurate internal connections between words and visual representations. Besides, the limited number of characters in social media posts leads to semantic and contextual ambiguity, further exacerbating the semantic mismatch between modalities. (ii) Existing methods employ cross-modal attention mechanisms to facilitate interaction and fusion between different modalities, overlooking fine-grained correspondences between semantic units of text and images. To alleviate these issues, we propose a graph neural network approach for MNER (GNN-MNER), which promotes fine-grained alignment and interaction between semantic units of different modalities. Specifically, to mitigate the issue of semantic mismatch between modalities, we construct corresponding graph structures for text and images, and leverage graph convolutional networks to augment text and visual representations. For the second issue, we propose a multimodal interaction graph to explicitly represent the fine-grained semantic correspondences between text and visual objects. Based on this graph, we implement deep-level feature fusion between modalities utilizing graph attention networks. Compared with existing methods, our approach is the first to extend graph deep learning throughout the MNER task. Extensive experiments on the Twitter multimodal datasets validate the effectiveness of our GNN-MNER.

Read full abstract

Chest radiology imaging plays a crucial role in the early screening, diagnosis, and treatment of chest diseases. The accurate interpretation of radiological images and the automatic generation of radiology reports not only save the doctor's time but also mitigate the risk of errors in diagnosis. The core objective of automatic radiology report generation is to achieve precise mapping of visual features and lesion descriptions at multi-scale and fine-grained levels. Existing methods typically combine global visual features and textual features to generate radiology reports. However, these approaches may ignore the key lesion areas and lack sensitivity to crucial lesion location information. Furthermore, achieving multi-scale characterization and fine-grained alignment of medical visual features and report text features proves challenging, leading to a reduction in the quality of radiology report generation. Addressing these issues, we propose a method for chest radiology report generation based on cross-modal multi-scale feature fusion. First, an auxiliary labeling module is designed to guide the model to focus on the lesion region of the radiological image. Second, a channel attention network is employed to enhance the characterization of location information and disease features. Finally, a cross-modal features fusion module is constructed by combining memory matrices, facilitating fine-grained alignment between multi-scale visual features and reporting text features on corresponding scales. The proposed method is experimentally evaluated on two publicly available radiological image datasets. The results demonstrate superior performance based on BLEU and ROUGE metrics compared to existing methods. Particularly, there are improvements of 4.8% in the ROUGE metric and 9.4% in the METEOR metric on the IU X-Ray dataset. Moreover, there is a 7.4% enhancement in BLEU-1 and a 7.6% improvement in the BLEU-2 on the MIMIC-CXR dataset.

Read full abstract

Fine-grained Alignment Research Articles

Articles published on Fine-grained Alignment

Fine-grained semantic oriented embedding set alignment for text-based person search

Fine-Grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection.

SDDA: A progressive self-distillation with decoupled alignment for multimodal image–text classification

Visual primitives as words: Alignment and interaction for compositional zero-shot learning

GNN-Based Multimodal Named Entity Recognition

Direction-Aware Video Demoiréing with Temporal-Guided Bilateral Learning

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval

Chest radiology report generation based on cross-modal multi-scale feature fusion

Self-Supervised Multimodal Learning: A Survey.

PhraseAug: An Augmented Medical Report Generation Model with Phrasebook.

Automatic Seabed Target Segmentation of AUV via Multilevel Adversarial Network and Marginal Distribution Adaptation

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching.

Towards Fast and Accurate Image-Text Retrieval With Self-Supervised Fine-Grained Alignment

Eye Gaze Guided Cross-Modal Alignment Network for Radiology Report Generation.

An Image-Text Matching Method for Multi-Modal Robots

Multi-level knowledge-driven feature representation and triplet loss optimization network for image–text retrieval

Cross-modal fine-grained alignment and fusion network for multimodal aspect-based sentiment analysis

Critical Classes and Samples Discovering for Partial Domain Adaptation.

Unsupervised Adversarial Domain Adaptation Regression for Rate of Penetration Prediction

Text-based person search via local-relational-global fine grained alignment

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Fine-grained Alignment Research Articles

Articles published on Fine-grained Alignment

Fine-grained semantic oriented embedding set alignment for text-based person search

Fine-Grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection.

SDDA: A progressive self-distillation with decoupled alignment for multimodal image–text classification

Visual primitives as words: Alignment and interaction for compositional zero-shot learning

GNN-Based Multimodal Named Entity Recognition

Direction-Aware Video Demoiréing with Temporal-Guided Bilateral Learning

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval

Chest radiology report generation based on cross-modal multi-scale feature fusion

Self-Supervised Multimodal Learning: A Survey.

PhraseAug: An Augmented Medical Report Generation Model with Phrasebook.

Automatic Seabed Target Segmentation of AUV via Multilevel Adversarial Network and Marginal Distribution Adaptation

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching.

Towards Fast and Accurate Image-Text Retrieval With Self-Supervised Fine-Grained Alignment

Eye Gaze Guided Cross-Modal Alignment Network for Radiology Report Generation.

An Image-Text Matching Method for Multi-Modal Robots

Multi-level knowledge-driven feature representation and triplet loss optimization network for image–text retrieval

Cross-modal fine-grained alignment and fusion network for multimodal aspect-based sentiment analysis

Critical Classes and Samples Discovering for Partial Domain Adaptation.

Unsupervised Adversarial Domain Adaptation Regression for Rate of Penetration Prediction

Text-based person search via local-relational-global fine grained alignment