Cross-modal Relationship Research Articles

Multimodal neuroimaging has gained traction in Alzheimer's Disease (AD) diagnosis by integrating information from multiple imaging modalities to enhance classification accuracy. However, effectively handling heterogeneous data sources and overcoming the challenges posed by multiscale transform methods remains a significant hurdle. This article proposes a novel approach to address these challenges. To harness the power of diverse neuroimaging data, we employ a strategy that leverages optimized convolution techniques. These optimizations include varying kernel sizes and the incorporation of instance normalization, both of which play crucial roles in feature extraction from magnetic resonance imaging (MRI) and positron emission tomography (PET) images. Specifically, varying kernel sizes allow us to adapt the receptive field to different image characteristics, enhancing the model's ability to capture relevant information. Furthermore, we employ transposed convolution, which increases spatial resolution of feature maps, and it is optimized with varying kernel sizes and instance normalization. This heightened resolution facilitates the alignment and integration of data from disparate MRI and PET data. The use of larger kernels and strides in transposed convolution expands the receptive field, enabling the model to capture essential cross-modal relationships. Instance normalization, applied to each modality during the fusion process, mitigates potential biases stemming from differences in intensity, contrast, or scale between modalities. This enhancement contributes to improved model performance by reducing complexity and ensuring robust fusion. The performance of the proposed fusion method is assessed on three distinct neuroimaging datasets, which include: Alzheimer's Disease Neuroimaging Initiative (ADNI), consisting of 50 participants each at various stages of AD for both MRI and PET (Cognitive Normal, AD, and Early Mild Cognitive); Open Access Series of Imaging Studies (OASIS), consisting of 50 participants each at various stages of AD for both MRI and PET (Cognitive Normal, Mild Dementia, Very Mild Dementia); and whole-brain atlas neuroimaging (AANLIB) (consisting of 50 participants each at various stages of AD for both MRI and PET (Cognitive Normal, AD). To evaluate the quality of the fused images generated via our method, we employ a comprehensive set of evaluation metrics, including Structural Similarity Index Measurement (SSIM), which assesses the structural similarity between two images; Peak Signal-to-Noise Ratio (PSNR), which measures how closely the generated image resembles the ground truth; Entropy (E), which assesses the amount of information preserved or lost during fusion; the Feature Similarity Indexing Method (FSIM), which assesses the structural and feature similarities between two images; and Edge-Based Similarity (EBS), which measures the similarity of edges between the fused and ground truth images. The obtained fused image is further evaluated using a Mobile Vision Transformer. In the classification of AD vs. Cognitive Normal, the model achieved an accuracy of 99.00%, specificity of 99.00%, and sensitivity of 98.44% on the AANLIB dataset.

Read full abstract

With the massive social media data available online, the conventional single modality emotion classification has developed into more complex models of multimodal sentiment analysis. Most existing works simply extracted image features at a coarse level, resulting in the absence of partially detailed visual features. Besides, social media data usually contain multiple images, while existing works considered a single image case and used only one image for representing visual features. In fact, it is nontrivial to extend the single image case to the multiple images case, due to the complex relations among multiple images. To solve the above issues, in this paper, we propose a <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">G ated <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">F usion <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S emantic <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">R elation (GFSR) network to explore semantic relations for social media sentiment analysis. In addition to inter-relations between visual and textual modalities, we also exploit intra-relations among multiple images, potentially improving the sentiment analysis performance. Specifically, we design a gated fusion network to fuse global image embeddings and the corresponding local Adjective Noun Pair (ANP) embeddings. Then, apart from textual relations and cross-modal relations, we employ the multi-head cross attention mechanism between images and ANPs to capture similar semantic contents. Eventually, the updated textual and visual representations are concatenated for the final sentiment prediction. Extensive experiments are conducted on real-world <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Yelp and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Flickr30k datasets, showing that our GFSR can improve about 0.10% to 3.66% in terms of accuracy on the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Yelp dataset with multiple images, and achieve the best accuracy for two classes and the best macro F1 for three classes on the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Flickr30k dataset with a single image.

Read full abstract

Cross-modal Relationship Research Articles

Articles published on Cross-modal Relationship

Pan-Mamba: Effective pan-sharpening with state space model

Text-and-Image Learning Transformer for Cross-modal Person Re-identification

Incorporating texture and silhouette for video-based person re-identification

How bodily perception parallels distal perception

Opposite perceptual biases in analogous auditory and visual tasks are unique to consonant–vowel strings and are unlikely a consequence of repetition

Cascade transformers with dynamic attention for video question answering

CAF-ODNN: Complementary attention fusion with optimized deep neural network for multimodal fake news detection

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Visuomotor Navigation for Embodied Robots With Spatial Memory and Semantic Reasoning Cognition.

Multi-modal fusion network with intra- and inter-modality attention for prognosis prediction in breast cancer

Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering

Optimized Convolutional Fusion for Multimodal Neuroimaging in Alzheimer's Disease Diagnosis: Enhancing Data Integration and Feature Extraction.

Personalized optimal room temperature and illuminance for maximizing occupant's mental task performance using physiological data

Cross-Modal Sentiment Analysis of Text and Video Based on Bi-GRU Cyclic Network and Correlation Enhancement

Semantic and Relation Modulation for Audio-Visual Event Localization.

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

Cross-scale cascade transformer for multimodal human action recognition

Exploring Semantic Relations for Social Media Sentiment Analysis

<i>c</i>-SNE: Deep Cross-modal Retrieval based on Subjective Information using Stochastic Neighbor Embedding

Improving Inconspicuous Attributes Modeling for Person Search by Language.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cross-modal Relationship Research Articles

Articles published on Cross-modal Relationship

Pan-Mamba: Effective pan-sharpening with state space model

Text-and-Image Learning Transformer for Cross-modal Person Re-identification

Incorporating texture and silhouette for video-based person re-identification

How bodily perception parallels distal perception

Opposite perceptual biases in analogous auditory and visual tasks are unique to consonant–vowel strings and are unlikely a consequence of repetition

Cascade transformers with dynamic attention for video question answering

CAF-ODNN: Complementary attention fusion with optimized deep neural network for multimodal fake news detection

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Visuomotor Navigation for Embodied Robots With Spatial Memory and Semantic Reasoning Cognition.

Multi-modal fusion network with intra- and inter-modality attention for prognosis prediction in breast cancer

Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering

Optimized Convolutional Fusion for Multimodal Neuroimaging in Alzheimer's Disease Diagnosis: Enhancing Data Integration and Feature Extraction.

Personalized optimal room temperature and illuminance for maximizing occupant's mental task performance using physiological data

Cross-Modal Sentiment Analysis of Text and Video Based on Bi-GRU Cyclic Network and Correlation Enhancement

Semantic and Relation Modulation for Audio-Visual Event Localization.

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

Cross-scale cascade transformer for multimodal human action recognition

Exploring Semantic Relations for Social Media Sentiment Analysis

&lt;i&gt;c&lt;/i&gt;-SNE: Deep Cross-modal Retrieval based on Subjective Information using Stochastic Neighbor Embedding

Improving Inconspicuous Attributes Modeling for Person Search by Language.

<i>c</i>-SNE: Deep Cross-modal Retrieval based on Subjective Information using Stochastic Neighbor Embedding