A Hierarchical Attention-Based Fusion Model for Multimodal Sentiment Analysis of Customer-Generated Review Videos

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Videos have become a common medium for customers to share their feedback, driven by the widespread use of the internet and the resulting proliferation of diverse content across social media platforms. These videos often contain multimodal forms of data, such as text, audio, and visual components, making them rich sources for sentiment analysis (SA). Multimodal sentiment analysis (MSA) combines data from all these modalities to improve understanding of sentiments and achieve an accurate prediction. However, existing multimodal fusion methods do not perform satisfactorily well, as they either concatenate raw features or often overlook intricate interdependencies between modalities, failing to resolve conflicts between them. This work attempts to implement a novel fusion method, Hierarchical Attention-Based Fusion (HABF), which uses self-attention and cross-attention mechanisms in a hierarchical manner that prioritizes and integrates efficient features coming from all modalities. HABF combines unimodal features by assigning contextual weights to each modality, ensuring the most accurate representation of sentiments in videos. The support vector machine then classifies the fused representation into positive, neutral, or negative sentiments. The model is tested on two datasets: (1) the Customer-Generated Sentiment Videos (CGSV) dataset, which is restaurant review-based, and (2) the standard dataset, CMU-MOSI. The proposed model uses Bidirectional Long Short-Term Memory (BiLSTM) for text, Librosa for audio, and a Convolutional Neural Network (CNN) for visual data for feature extraction. The evaluated results are compared with existing systems using CMU-MOSI, achieving better accuracy with 82.43%. The proposed model shows an accuracy of 78.10% using the CGSV dataset. This study focuses mainly on improving MSA for customer satisfaction, dining experience, and service quality, and subsequently increasing better business strategies for restaurants.

Similar Papers
  • Research Article
  • 10.11591/ijeecs.v40.i3.pp1707-1719
Aspect based multimodal sentiment analysis of product reviews using deep learning techniques
  • Dec 1, 2025
  • Indonesian Journal of Electrical Engineering and Computer Science
  • Anitha Padigapati + 1 more

<p>Sentiment analysis plays a crucial role in understanding customer opinions, particularly in product reviews. Traditional approaches primarily focus on textual data; however, with the rise of social media, incorporating multimodal data, including text and emojis, enhances sentiment analysis accuracy. This research introduces a multimodal aspect-based sentiment analysis (MABSA) framework, integrating textual and emoji representations for Samsung M21 product reviews from Flipkart. The methodology involves data preprocessing, aspect extraction, sentiment grouping, and feature extraction using deep learning (DL) techniques. Bidirectional long shortterm memory (Bi-LSTM) networks are employed for classification, leveraging Word2Vec, Emoji2Vec, and bidirectional encoder representations from transformers (BERT) embeddings. Experimental results show that BERT with Bi-LSTM outperforms Word2Vec with Bi-LSTM, achieving 95.6% accuracy in aspect prediction and 96.28% accuracy in sentiment classification. Comparative analysis with existing models highlights the superiority of the MASAT model, effectively integrating implicit aspects, emoticons, and emojis. The study demonstrates the importance of multimodal sentiment analysis for a more comprehensive understanding of user opinions, offering valuable insights for businesses to enhance customer satisfaction.</p>

  • Research Article
  • Cite Count Icon 1
  • 10.55529/jecnam.44.22.31
Fine-Grained Sentiment Classification Using Generative Pretrained Transformer
  • Jul 19, 2024
  • Journal of Electronics,Computer Networking and Applied Mathematics
  • Gul Nawaz + 1 more

Social media platforms have seen a significant increase in the number of users and content in recent years. Owing to the increased usage of these platforms, incidents of teasing, provocation—both positive and negative—and harassment, and community attacks have increased tremendously. There is an urgent need to automatically identify such content or tweets that can hamper the well-being of an individual or society. Analyzing social media messages from Twitter and Facebook has become the focus of sentiment analysis in recent years, which formerly focused on online product evaluations. Sentiment analysis is used in a wide range of fields besides product reviews, including harassment, stock markets, elections, disasters, and software engineering. After the tweets have been preprocessed, the extracted features are categorized using classifiers like decision trees, logistic regression, multinomial nave Bayes, support vector machines, random forests, and Bernoulli nave Bayes, as well as deep learning techniques like recurrent neural network (RNN) models, long short-term memory (LSTM) models, bidirectional long short-term memory (BiLSTM) models, and convolutional neural network (CNN) model for sentiment analysis. In this paper, different techniques are compared to classify Twitter tweets into three categories: “positive,” “negative,” and “neutral.” We proposed a novel data-balancing technique for text classification. A text classification technique is proposed for analyzing textual data using the Generative Pretrained Transformer model owing to its contextual understanding and more realistic data generation capability. Comparative analysis of different Machine learning and Deep learning models are performed with and without data balancing. The experiments show that the accuracy and F1-measure of the Twitter sentiment classification classifier are improved. The proposed ensemble has outperformed and achieved an accuracy of 90%, precision of 88%, and 81% F1 score.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 12
  • 10.1371/journal.pone.0273936
AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model
  • Sep 9, 2022
  • PLOS ONE
  • Ji Mingyu + 2 more

Multimodal sentiment analysis is an essential task in natural language processing which refers to the fact that machines can analyze and recognize emotions through logical reasoning and mathematical operations after learning multimodal emotional features. For the problem of how to consider the effective fusion of multimodal data and the relevance of multimodal data in multimodal sentiment analysis, we propose an attention-based mechanism feature relevance fusion multimodal sentiment analysis model (AFR-BERT). In the data pre-processing stage, text features are extracted using the pre-trained language model BERT (Bi-directional Encoder Representation from Transformers), and the BiLSTM (Bi-directional Long Short-Term Memory) is used to obtain the internal information of the audio. In the data fusion phase, the multimodal data fusion network effectively fuses multimodal features through the interaction of text and audio information. During the data analysis phase, the multimodal data association network analyzes the data by exploring the correlation of fused information between text and audio. In the data output phase, the model outputs the results of multimodal sentiment analysis. We conducted extensive comparative experiments on the publicly available sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experimental results show that AFR-BERT improves on the classical multimodal sentiment analysis model in terms of relevant performance metrics. In addition, ablation experiments and example analysis show that the multimodal data analysis network in AFR-BERT can effectively capture and analyze the sentiment features in text and audio.

  • Research Article
  • Cite Count Icon 1
  • 10.1371/journal.pone.0273936.r004
AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model
  • Sep 9, 2022
  • PLoS ONE
  • Ji Mingyu + 3 more

Multimodal sentiment analysis is an essential task in natural language processing which refers to the fact that machines can analyze and recognize emotions through logical reasoning and mathematical operations after learning multimodal emotional features. For the problem of how to consider the effective fusion of multimodal data and the relevance of multimodal data in multimodal sentiment analysis, we propose an attention-based mechanism feature relevance fusion multimodal sentiment analysis model (AFR-BERT). In the data pre-processing stage, text features are extracted using the pre-trained language model BERT (Bi-directional Encoder Representation from Transformers), and the BiLSTM (Bi-directional Long Short-Term Memory) is used to obtain the internal information of the audio. In the data fusion phase, the multimodal data fusion network effectively fuses multimodal features through the interaction of text and audio information. During the data analysis phase, the multimodal data association network analyzes the data by exploring the correlation of fused information between text and audio. In the data output phase, the model outputs the results of multimodal sentiment analysis. We conducted extensive comparative experiments on the publicly available sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experimental results show that AFR-BERT improves on the classical multimodal sentiment analysis model in terms of relevant performance metrics. In addition, ablation experiments and example analysis show that the multimodal data analysis network in AFR-BERT can effectively capture and analyze the sentiment features in text and audio.

  • PDF Download Icon
  • Research Article
  • 10.17485/ijst/v16i9.1164
New Education Policy 2020: A Sentiment Classification
  • May 3, 2023
  • Indian Journal Of Science And Technology
  • E Sujatha + 1 more

<h2>Abstract</h2>\n<p><strong>Objectives:</strong> To develop a model of multi-class classification which provides better performance for the large dataset. To reduce complexity of the model and to analyse the sentiments of twitter data in an efficient way. <strong>Methods:</strong> The sentiment analysis has been performed on the New Education Policy 2020. Totally, 105045 tweets were collected from the Twitter database using Tweepy library in python. The sentiment analysis was done on English tweets. The preprocessing and feature extraction was done by using pyspark packages. The hybrid of unigram and bigrams feature sets was used. To extract the labelled dataset, AFINN dictionary was used. The classifiers such as Random Forest in Machine Learning and Convolutional Neural Network, Bidirectional Long Short- Term Memory in Deep Learning were used to determine positive, negative and neutral sentiments of tweets. <strong>Findings:</strong> The Accuracy (97%), Precision (97%), Recall (97%), F-Measure (97%) and 99% of ROC-AUC with the minimum Log Loss 0.10 was obtained by the hybrid of Convolutional Neural Network and Bidirectional Long Short-Term Memory.<strong> Novelty :</strong> The complexity of the model was reduced by using Convolutional Neural Network which selects the relevant features. The performance of the model was evaluated by using the various metrics such as accuracy, precision, recall, f-score, log loss and roc-auc whereas in the existing works only limited metrics were used. The efficiency of the proposed model can be proved in any case.</p>\n<p><strong>Keywords:</strong> Random Forest Classifier (RF); Convolutional Neural Network (CNN); Bidirectional Long Short-Term Memory (BLSTM); Support Vector Machine (SVM); Term Frequency – Inverse Document Frequency (TF-IDF)</p>

  • Conference Article
  • 10.1109/apcit65661.2025.11410655
Cross-Attention-Driven Multimodal Sentimental Analysis with Visual-Textual Integration and Gating Mechanism
  • Sep 19, 2025
  • Krishnaveni Vengala + 2 more

Traditional text-based sentiment analysis often falls short of understanding the emotional context of social media material because social media platforms like Twitter blend text and graphics. This work employs a multimodal sentiment analysis technique that combines textual and visual data to improve sentiment classification in order to get around these limitations.The first method effectively identifies objectionable content on Twitter by using Long Short-Term Memory (LSTM) networks for text feature extraction and Convolutional Neural Networks (CNNs) for visual features analysis. Building on this, a cross-attention based fusion model along with gating mechanism is used to further enhance sentiment classification into positive, neutral, and negative categories. This latest model interprets text in using an ALBert and BiLSTM networks and evaluates images through DenseNet and Convolutional Block Attention Module (CBAM). Its cross-attention mechanism is able to make more complex and accurate sentiment analysis through coordination with emotional information generated from each different sense and gating mechanism will decide how much of cross attended and single modality is used in the final fusion to not dilute the feature quality in case of noise data.These results demonstrate the power of advanced fusion techniques in multimodal sentiment analysis to reach a significantly richer and more contextualized interpretation of the sentiment in social media than purely text-based approaches.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.30865/mib.v7i3.6120
Bank Central Asia (BBCA) Stock Price Sentiment Analysis On Twitter Data Using Neural Convolutional Network (CNN) And Bidirectional Long Short-Term Memory (BI-LSTM)
  • Jul 23, 2023
  • JURNAL MEDIA INFORMATIKA BUDIDARMA
  • Mansel Lorenzo Nugraha + 1 more

Stock investing has become popular among the public. Although this stock investment has significant risks, every year, investors keep increasing because the return from stocks is also quite promising. Social media also supports this stock investing, which can give information extensively and very quickly, so it can affect the stock price. The Efficient Market Hypothesis (EMH) theory defines that market information reflects stock prices. In this research, sentiment analysis uses a dataset crawled from Twitter to process the sentiment into helpful information. All the tweets related to stock prices are collected for sentiment analysis according to the appropriate sentiment type, whether it's a positive or negative sentiment. Many believe that sentiment influences stock price movements. This sentiment analysis process uses a hybrid method named Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM) with feature expansion Word2Vec. Afterwards, the hybrid method analysis will establish the final accuracy obtained. This research uses 27.930 data and shows the hybrid CNN Bi-LSTM method result is 95.74%.

  • Research Article
  • Cite Count Icon 59
  • 10.1016/j.neucom.2020.10.021
A cognitive brain model for multimodal sentiment analysis based on attention neural networks
  • Oct 14, 2020
  • Neurocomputing
  • Yuanqing Li + 3 more

A cognitive brain model for multimodal sentiment analysis based on attention neural networks

  • Research Article
  • Cite Count Icon 6
  • 10.1108/k-04-2023-0723
Understanding public opinions on Chinese short video platform by multimodal sentiment analysis using deep learning-based techniques
  • Sep 12, 2023
  • Kybernetes
  • Wei Shi + 2 more

PurposeWith the rapid development of short videos in China, the public has become accustomed to using short videos to express their opinions. This paper aims to solve problems such as how to represent the features of different modalities and achieve effective cross-modal feature fusion when analyzing the multi-modal sentiment of Chinese short videos (CSVs).Design/methodology/approachThis paper aims to propose a sentiment analysis model MSCNN-CPL-CAFF using multi-scale convolutional neural network and cross attention fusion mechanism to analyze the CSVs. The audio-visual and textual data of CSVs themed on “COVID-19, catering industry” are collected from CSV platform Douyin first, and then a comparative analysis is conducted with advanced baseline models.FindingsThe sample number of the weak negative and neutral sentiment is the largest, and the sample number of the positive and weak positive sentiment is relatively small, accounting for only about 11% of the total samples. The MSCNN-CPL-CAFF model has achieved the Acc-2, Acc-3 and F1 score of 85.01%, 74.16 and 84.84%, respectively, which outperforms the highest value of baseline methods in accuracy and achieves competitive computation speed.Practical implicationsThis research offers some implications regarding the impact of COVID-19 on catering industry in China by focusing on multi-modal sentiment of CSVs. The methodology can be utilized to analyze the opinions of the general public on social media platform and to categorize them accordingly.Originality/valueThis paper presents a novel deep-learning multimodal sentiment analysis model, which provides a new perspective for public opinion research on the short video platform.

  • Research Article
  • 10.55041/ijsrem16617
Comparative Analysis of Deep Learning Approaches for Twitter Text Classification
  • Oct 21, 2022
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Lukesh Kadu

Abstract—Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly. Sentiment analysis aims to extract opinion automatically from data and classify them as positive and negative. Twitter widely used social media tools, been seen as an important source of information for acquiring people’s attitudes, emotions, views, and feedbacks. Within this context, Twitter sentiment analysis techniques were developed to decide whether textual tweets express a positive or negative opinion. In contrast to lower classification performance of traditional algorithms, deep learning models, including Convolution Neural Network (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM), have achieved a significant result in sentiment analysis. Keras is a Deep Learning (DL) framework that provides an embedding layer to produce the vector representation of words present in the document. The objective of this work is to analyze the performance of deep learning models namely Convolutional Neural Network (CNN), Simple Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM), bidirectional Long Short-Term Memory (Bi-LSTM), BERT and RoBERTa for classifying the twitter reviews. From the experiments conducted, it is found that RoBERTa model performs better than CNN and simple RNN for sentiment classification. Keywords—Convolution Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Deep Learning, Bidirectional Long Short-Term Memory (BiLSTM), Bidirectional Encoder Representations from Transformers (BERT), Robustly Optimized BERT Pre-training Approach (RoBERTa).

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-16-0708-0_3
Gujarati Task Oriented Dialogue Slot Tagging Using Deep Neural Network Models
  • Jan 1, 2021
  • Rachana Parikh + 1 more

In this paper, the primary focus is of Slot Tagging of Gujarat Dialogue, which enables the Gujarati language communication between human and machine, allowing machines to perform given task and provide desired output. The accuracy of tagging entirely depends on bifurcation of slots and word embedding. It is also very challenging for a researcher to do proper slot tagging as dialogue and speech differs from human to human, which makes the slot tagging methodology more complex. Various deep learning models are available for slot tagging for the researchers, however, in the instant paper it mainly focuses on Long Short-Term Memory (LSTM), Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM) and Long Short-Term Memory – Conditional Random Field (LSTM-CRF), Bidirectional Long Short-Term Memory (BiLSTM), Convolutional Neural Network - Bidirectional Long Short-Term Memory (CNN-BiLSTM) and Bidirectional Long Short-Term Memory – Conditional Random Field (BiLSTM-CRF). While comparing the above models with each other, it is observed that BiLSTM models performs better than LSTM models by a variation ~2% of its F1-measure, as it contains an additional layer which formulates the word string to traverse from backward to forward. Within BiLSTM models, BiLSTM-CRF has outperformed other two Bi-LSTM models. Its F1-measure is better than CNN-BiLSTM by 1.2% and BiLSTM by 2.4%.KeywordsSpoken Language Understanding (SLU)Long Short-Term Memory (LSTM)Slot taggingBidirectional Long Short-Term Memory (BiLSTM)Convolutional Neural Network - Bidirectional Long Short-Term Memory (CNN-BiLSTM)Bidirectional Long Short-Term Memory (BiLSTM-CRF)

  • Research Article
  • 10.1145/3695251
Empowering Digital Civility with an NLP Approach for Detecting 𝕏 (Formerly Known as Twitter) Cyberbullying through Boosted Ensembles
  • Nov 23, 2024
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Senthil Prabakaran + 2 more

As the number of social networking sites grows, so do cyber dangers. Cyberbullying is harmful behavior that uses technology to intimidate, harass, or harm someone, often on social media platforms like 𝕏 (formerly known as Twitter). Machine learning is the optimal approach for cyberbullying detection on 𝕏 to process large amounts of data, identify patterns of offensive behavior, and automate the detection process for corpus of tweets. To identify cyber threats using a trained model, the boosted ensemble (BE) technique is assessed with various machine learning algorithms such as the convolutional neural network (CNN), long short-term memory (LSTM), naive Bayes (NB), decision tree (DT), support vector machine (SVM), bidirectional LSTM (BILSTM), recurrent neural network LSTM (RNN-LSTM), multi-modal cyberbullying detection (MMCD), and random forest (RF). These classifiers are trained on the vectorized data to classify the tweets to identify cyberbullying threats. The proposed framework can detect cyberbullying cases precisely on tweets. The significance of the work lies in detecting and mitigating cyber threats in real time, and it impacts in enhancing the safety and well-being of social media users by reducing instances of cyberbullying and other cyber threats. The comparative analysis is done using metrics like accuracy, precision, recall, and F1-score, and the comparison results show that the BE technique outperforms other compared algorithms with its overall performance. Respectively, the accuracy rates of CNN, LSTM, NB, DT, SVM, RF, BILSTM, and BE are 92.5%, 93.5%, 84.6%, 88%, 89.3%, 92%, 93.75%, and 96%; precision rates of CNN, LSTM, NB, DT, SVM, RF, RNN-LSTM, and BE are 90.2%, 91.3%, 88%, 85%, 86%, 91.6%, 92.1%, and 94%; recall rates of CNN, LSTM, NB, DT, SVM, RF, BILSTM, and BE are 89.8%, 90.7%, 90%, 82%, 88.67%, 89%, 91.04%, and 93.7%; and F1-scores of CNN, LSTM, NB, DT, SVM, RF, MMCD, and BE are 90.6%, 91.8%, 85%, 84.56% 87.2%, 90%, 84.6%, and 94.89%.

  • Research Article
  • Cite Count Icon 11
  • 10.3934/mbe.2023822
AB-GRU: An attention-based bidirectional GRU model for multimodal sentiment fusion and analysis.
  • Jan 1, 2023
  • Mathematical Biosciences and Engineering
  • Jun Wu + 4 more

Multimodal sentiment analysis is an important area of artificial intelligence. It integrates multiple modalities such as text, audio, video and image into a compact multimodal representation and obtains sentiment information from them. In this paper, we improve two modules, i.e., feature extraction and feature fusion, to enhance multimodal sentiment analysis and finally propose an attention-based two-layer bidirectional GRU (AB-GRU, gated recurrent unit) multimodal sentiment analysis method. For the feature extraction module, we use a two-layer bidirectional GRU network and connect two layers of attention mechanisms to enhance the extraction of important information. The feature fusion part uses low-rank multimodal fusion, which can reduce the multimodal data dimensionality and improve the computational rate and accuracy. The experimental results demonstrate that the AB-GRU model can achieve 80.9% accuracy on the CMU-MOSI dataset, which exceeds the same model type by at least 2.5%. The AB-GRU model also possesses a strong generalization capability and solid robustness.

  • Research Article
  • 10.14569/ijacsa.2024.01508119
TGMoE: A Text Guided Mixture-of-Experts Model for Multimodal Sentiment Analysis
  • Jan 1, 2024
  • International Journal of Advanced Computer Science and Applications
  • Xueliang Zhao + 3 more

Multimodal sentiment analysis seeks to determine the sentiment polarity of targets by integrating diverse data types, including text, visual, and audio modalities. However, during the process of multimodal data fusion, existing methods often fail to adequately analyze the sentimental relationships between different modalities and overlook the varying contributions of different modalities to sentiment analysis results. To address this issue, we propose a Text Guided Mixture-of-Experts (TGMoE) Model for Multimodal Sentiment Analysis. Based on the varying contributions of different modalities to sentiment analysis, this model introduces a text guided cross-modal attention mechanism that fuses text separately with visual and audio modalities, leveraging attention to capture interactions between these modalities and effectively enrich the text modality with supplementary information from the visual and audio data. Additionally, by employing a sparsely gated mixture of expert layers, the TGMoE model constructs multiple expert networks to simultaneously learn sentiment information, enhancing the nonlinear representation capability of multimodal features. This approach makes multimodal features more distinguishable concerning sentiment, thereby improving the accuracy of sentiment polarity judgments. The experimental results on the publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI show that the TGMoE model outperforms most existing multimodal sentiment analysis models and can effectively improve the performance of sentiment analysis.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 91
  • 10.18653/v1/d19-1566
Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis
  • Jan 1, 2019
  • Dushyant Singh Chauhan + 3 more

Dushyant Singh Chauhan, Md Shad Akhtar, Asif Ekbal, Pushpak Bhattacharyya. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant