Textual Modalities Research Articles

In this paper, we present all-embracing Transformers (AaTs) that are capable of deftly manipulating attention mechanism for Received Signal Strength (RSS) fingerprints in order to invigorate localizing performance. Since most machine learning models applied to the RSS modality do not possess any attention mechanism, they can merely capture superficial representations. Moreover, compared to textual and visual modalities, the RSS modality is inherently notorious for its sensitivity to environmental dynamics. Such adversities inhibit their access to subtle but distinct representations that characterize the corresponding location, ultimately resulting in significant degradation in the testing phase. In contrast, a major appeal of AaTs is the ability to focus exclusively on relevant anchors in RSS sequences, allowing full rein to the exploitation of subtle and distinct representations for specific locations. This also facilitates disregarding redundant clues formed by noisy ambient conditions, thus enhancing accuracy in localization. Apart from that, explicitly resolving the representation collapse (i.e., none-informative or homogeneous features, and gradient vanishing) can further invigorate the self-attention process in transformer blocks, by which subtle but distinct representations to specific locations are radically captured with ease. For that purpose, we first enhance our proposed model with two sub-constraints, namely covariance and variance losses at the Anchor2Vec. The proposed constraints are automatically mediated with the primary task towards a novel multi-task learning manner. In an advanced manner, we present further the ultimate in design with a few simple tweaks carefully crafted for transformer encoder blocks. This effort aims to promote representation augmentation via stabilizing the inflow of gradients to these blocks. Thus, the problems of representation collapse in regular Transformers can be tackled. To evaluate our AaTs, we compare the models with the state-of-the-art (SoTA) methods on three benchmark indoor localization datasets. The experimental results confirm our hypothesis and show that our proposed models could deliver much higher and more stable accuracy.

Read full abstract

Emotion recognition using multimodal data is a widely adopted approach due to its potential to enhance human interactions and various applications. By leveraging multimodal data for emotion recognition, the quality of human interactions can be significantly improved. We present the Multimodal Emotion Lines Dataset (MELD) and a novel method for multimodal emotion recognition using a bi-lateral gradient graph neural network (Bi-LG-GNN) and feature extraction and pre-processing. The multimodal dataset uses fine-grained emotion labeling for textual, audio, and visual modalities. This work aims to identify affective computing states successfully concealed in the textual and audio data for emotion recognition and sentiment analysis. We use pre-processing techniques to improve the quality and consistency of the data to increase the dataset’s usefulness. The process also includes noise removal, normalization, and linguistic processing to deal with linguistic variances and background noise in the discourse. The Kernel Principal Component Analysis (K-PCA) is employed for feature extraction, aiming to derive valuable attributes from each modality and encode labels for array values. We propose a Bi-LG-GCN-based architecture explicitly tailored for multimodal emotion recognition, effectively fusing data from various modalities. The Bi-LG-GCN system takes each modality's feature-extracted and pre-processed representation as input to the generator network, generating realistic synthetic data samples that capture multimodal relationships. These generated synthetic data samples, reflecting multimodal relationships, serve as inputs to the discriminator network, which has been trained to distinguish genuine from synthetic data. With this approach, the model can learn discriminative features for emotion recognition and make accurate predictions regarding subsequent emotional states. Our method was evaluated on the MELD dataset, yielding notable results in terms of accuracy (80%), F1-score (81%), precision (81%), and recall (81%) when using the MELD dataset. The pre-processing and feature extraction steps enhance input representation quality and discrimination. Our Bi-LG-GCN-based approach, featuring multimodal data synthesis, outperforms contemporary techniques, thus demonstrating its practical utility.

Read full abstract

Textual Modalities Research Articles

Related Topics

Articles published on Textual Modalities

Mutual-Modality Adversarial Attack with Semantic Perturbation

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

Dynamic Weighted Combiner for Mixed-Modal Image Retrieval

Data Roaming and Quality Assessment for Composed Image Retrieval

Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding

A Hierarchical Network for Multimodal Document-Level Relation Extraction

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval

Adaptive Graph Learning for Multimodal Conversational Emotion Detection

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

An inter-modal attention-based deep learning framework using unified modality for multimodal fake news, hate speech and offensive language detection

Seeing the world from its words: All-embracing Transformers for fingerprint-based indoor localization

SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text Feedback

Bridging the Cross-Modality Semantic Gap in Visual Question Answering.

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

DI-VTR: Dual inter-modal interaction model for video-text retrieval

Multimodal Emotion Recognition Using Bi-LG-GCN for MELD Dataset

Multimodal Social Network Analysis: Exploring Language, Visual, and Audio Data in Online Communities

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Textual Modalities Research Articles

Related Topics

Articles published on Textual Modalities

Mutual-Modality Adversarial Attack with Semantic Perturbation

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

Dynamic Weighted Combiner for Mixed-Modal Image Retrieval

Data Roaming and Quality Assessment for Composed Image Retrieval

Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding

A Hierarchical Network for Multimodal Document-Level Relation Extraction

Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval

Adaptive Graph Learning for Multimodal Conversational Emotion Detection

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

An inter-modal attention-based deep learning framework using unified modality for multimodal fake news, hate speech and offensive language detection

Seeing the world from its words: All-embracing Transformers for fingerprint-based indoor localization

SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text Feedback

Bridging the Cross-Modality Semantic Gap in Visual Question Answering.

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

DI-VTR: Dual inter-modal interaction model for video-text retrieval

Multimodal Emotion Recognition Using Bi-LG-GCN for MELD Dataset

Multimodal Social Network Analysis: Exploring Language, Visual, and Audio Data in Online Communities