AraABSAMD: A novel arabic dataset for aspect-based sentiment analysis in the Moroccan education domain
AraABSAMD: A novel arabic dataset for aspect-based sentiment analysis in the Moroccan education domain
- Conference Article
4
- 10.1109/iciss53185.2021.9533255
- Aug 2, 2021
Homecoming, more traditionally known as Mudik, has become a trending topic on several social media platforms as soon as the 11-day homecoming ritual ban was announced on 7 April 2021. Opinions, varying from those in favor of and against the ban, start to rapidly appear. Twitter, a social media platform which is now considered to be an extension of oneself and often used to express ones’ opinion, has become flooded with comments on the homecoming ritual ban. The swarm of opinions in the form of tweets were then used as a dataset for sentiment analysis in order to understand how people perceive the ban. The algorithm used in this research is the classification algorithm using the Support Vector Machine method. The dataset was classified into three sentiments: positive, negative, and neutral. The use of the Support Vector Machine algorithm yielded a 62% accuracy with this dataset. The sentiment analysis showed that the keyword "mudik" had a neutral sentiment for the most part. Meanwhile, results of engagement analysis show that the largest forms of engagements were retweets and liking tweets that had a neutral sentiment. When the neutral sentiment was removed, we found that the largest sentiment on the homecoming ritual ban was negative. This is likely due to the release of an addendum to the Covid-19 Handling Task Force Circular Number 13 of 2021 on 22 April 2021 that imposes more restrictions on and extends the effective dates of the restrictions related to the homecoming ritual ban; exactly one day before the data scraping of 5000 datasets on tweets from 23 April 2021 was carried out. The researcher had already sampled the tweets with the most engagements (those with the most retweets and likes). It was found that some tweets had a negative sentiment, but the model classified it as having a neutral sentiment. This may be affected by inaccuracies of dataset training as some of the tweets were in Malay rather than Indonesian. A challenge that needs to be overcome is the limited number of datasets for NLP training or sentiment analysis for the Indonesian language in comparison to that of the English language. On the other hand, this has become an opportunity for the researcher to develop a more appropriate training model.
- Conference Article
3
- 10.1109/ccai55564.2022.9807755
- May 6, 2022
Korean is the native and official language spoken by Chinese-Korean people, and Weibo is a social media software with a huge number of users in China. Currently, there is few studies related to sentiment analysis of Korean-language Weibo texts posted by Chinese-Korean users. In this paper, we propose a sentiment classification method for Chinese-Korean Weibo based on pre-trained language model and transfer learning. Firstly, we crawled the Chinese-Korean Weibo data from Sina Weibo and label them with sentiment to get the Chinese-Korean Weibo sentiment analysis (CKWSA) dataset. Secondly, to solve the problem of few training samples of the Chinese-Korean Weibo sentiment analysis dataset, we fine-tune the classifier based on the pre-trained Korean language model on the Korean Twitter sentiment analysis dataset to obtain the Korean Twitter sentiment classification model; and further fine-tune the model on CKWSA dataset to get Chinese-Korean Weibo sentiment classification model. The experiments show that the proposed classification method based on pre-trained language model and transfer learning has great performance, and there is an improvement compared other baselines on the Chinese-Korean Weibo sentiment analysis dataset.
- Research Article
9
- 10.3390/electronics10070800
- Mar 28, 2021
- Electronics
Deep learning-based methods have achieved good performance in various recognition benchmarks mostly by utilizing single modalities. As different modalities contain complementary information to each other, multi-modal based methods are proposed to implicitly utilize them. In this paper, we propose a simple technique, called correspondence learning (CL), which explicitly learns the relationship among multiple modalities. The multiple modalities in the data samples are randomly mixed among different samples. If the modalities are from the same sample (not mixed), then they have positive correspondence, and vice versa. CL is an auxiliary task for the model to predict the correspondence among modalities. The model is expected to extract information from each modality to check correspondence and achieve better representations in multi-modal recognition tasks. In this work, we first validate the proposed method in various multi-modal benchmarks including CMU Multimodal Opinion-Level Sentiment Intensity (CMU-MOSI) and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) sentiment analysis datasets. In addition, we propose a fraud detection method using the learned correspondence among modalities. To validate this additional usage, we collect a multi-modal dataset for fraud detection using real-world samples for reverse vending machines.
- Research Article
- 10.59313/jsr-a.1599759
- Jun 30, 2025
- Journal of Scientific Reports-A
This study analyzes the performance of the most popularly downloaded language models on the Hugging Face platform. For this purpose, the five most downloaded language models in Turkish and English were used. The analysis was evaluated in three phases. These stages were contextual learning, question and answer, and expert evaluation. ARC, Turkish sentiment analysis, Hellaswag, and MMLU datasets were used for contextual learning. For the question-and-answer test, the models trained with the text file created were asked questions from the text. Finally, six experts evaluated the answers given by the models from the developed mobile application. F1 score was used for context evaluation, Rouge-1, Rouge-2, and Rouge-L metrics were used for question and answer, and Elo and TrueSkill metrics were used for expert evaluations. The correlations of these metrics were calculated, and it was seen that there was a correlation of 0.74 between expert evaluations and question-answer performances. It was also observed that learning in context and question-answering performances were not correlated. When the language models were evaluated in general, the timpal0l/mdeberta-v3-base-squad2 language model performed the best. Turkish and English language models performed best on the sentiment analysis dataset with an F1 score above 0.85.
- Conference Article
60
- 10.18653/v1/n18-1171
- Jan 1, 2018
Sentiment analysis is used as a proxy to measure human emotion, where the objective is to categorize text according to some predefined notion of sentiment. Sentiment analysis datasets are typically constructed with gold-standard sentiment labels, assigned based on the results of manual annotations. When working with such annotations, it is common for dataset constructors to discard “noisy” or “controversial” data where there is significant disagreement on the proper label. In datasets constructed for the purpose of Twitter sentiment analysis (TSA), these controversial examples can compose over 30% of the originally annotated data. We argue that the removal of such data is a problematic trend because, when performing real-time sentiment classification of short-text, an automated system cannot know a priori which samples would fall into this category of disputed sentiment. We therefore propose the notion of a “complicated” class of sentiment to categorize such text, and argue that its inclusion in the short-text sentiment analysis framework will improve the quality of automated sentiment analysis systems as they are implemented in real-world settings. We motivate this argument by building and analyzing a new publicly available TSA dataset of over 7,000 tweets annotated with 5x coverage, named MTSA. Our analysis of classifier performance over our dataset offers insights into sentiment analysis dataset and model design, how current techniques would perform in the real world, and how researchers should handle difficult data.
- Research Article
10
- 10.1080/09540091.2023.2189119
- Apr 3, 2023
- Connection Science
Aspect-based sentiment analysis (ABSA) aims to automatically identify the sentiment polarity of specific aspect words in a given sentence or document. Existing studies have recognised the value of interactive learning in ABSA and have developed various methods to precisely model aspect words and their contexts through interactive learning. However, these methods mostly take a shallow interactive way to model aspect words and their contexts, which may lead to the lack of complex sentiment information. To solve this issue, we propose a Lightweight Multilayer Interactive Attention Network (LMIAN) for ABSA. Specifically, we first employ a pre-trained language model to initialise word embedding vectors. Second, an interactive computational layer is designed to build correlations between aspect words and their contexts. Such correlation degree is calculated by multiple computational layers with neural attention models. Third, we use a parameter-sharing strategy among the computational layers. This allows the model to learn complex sentiment features with lower memory costs. Finally, LMIAN conducts instance validation on six publicly available sentiment analysis datasets. Extensive experiments show that LMIAN performs better than other advanced methods with relatively low memory consumption.
- Research Article
2
- 10.52783/jes.1507
- Apr 4, 2024
- Journal of Electrical Systems
Sentiment analysis (SA) is a technique that applies natural language processing (NLP) in order to analyze and classify the emotion in sentiment reviews. SA is responsible for analyzing people's feelings, opinions, and experiences that are shared through the Internet and social networks. In this paper, we focus on investigating, evaluating, and improving Arabic sentiment analysis (ASA) models, datasets, and challenges. ASA has several difficulties, like language’s morphological features, many dialects, no clear and uniform corpora, low accuracy, and restricted ASA material. In order to do that, we do a full analysis and evaluation of Arabic sentiment analysis models and datasets that target e-marketing services such as telecommunication, health, and books. We evaluate our data set, called Sara, with several Arabic sentiment datasets in terms of brief description, dataset size, source of collecting data, field type, and abbreviation. We enhanced our previous models by using ensemble learning average techniques. The accuracy of our enhanced model has increased and now reaches around 97%. Also, we evaluate our developed ASA using deep learning (DL) algorithms with other ASA models in the field of e-marketing. Our models have significant improvements in terms of performance compared with other works, where our three models, CNN-Model, LSTM-Mode2, and CNN+LSTM-Model3, have accuracy of 96.83%, 94.74%, and 96.91%, respectively.
- Research Article
- 10.12732/ijam.v38i11s.1275
- Nov 10, 2025
- International Journal of Applied Mathematics
The exponential growth of user-generated text data on social media highlights the need for effective sentiment analysis systems that can be scaled to support large textual inputs with high contextual accuracy. In many traditional approaches, problems associated with noisy data, unsound feature selection, and lack of scalability prevail, thereby leading to a lack of effective sentiment classification solutions. This study introduces a new framework of aspect-based sentiment analysis (ABSA), which incorporates the best advanced techniques of preprocessing, feature extraction, feature selection, and classification to achieve unmatched performance. The methodology involves robust preprocessing, including tokenization, lexical normalization, and punctuation removal, to ensure a clean input for further processing. Feature extraction is performed using pretrained embeddings, such as robust optimized bidirectional encoder representations from transformers (RoBERTa) and Global Vectors for Word Representation (GloVe), capturing both contextual and word-level relationships. Feature selection employs a hybrid Arithmetic Optimization Algorithm (AOA) and Henry Gas Solubility Optimization (HGS) refined by hierarchical attention mechanisms to retain relevant features while reducing dimensionality. The classification phase utilizes an ((ARGCN), which integrates attention mechanisms and capsule networks to provide refined sentiment predictions. The experimental findings on Sentiment Analysis dataset 1 show 99.85% accuracy, 99.80% precision, 99.88% recall, 99.83% F1 score, and 99.90% specificity. The same was obtained for Dataset 2, with the metrics lying remarkably higher, with an accuracy of 99.89%, precision of 99.87%, recall of 99.90%, F1 score of 99.88%, and specificity of 99.92%. These results depict the robustness and scalability of the proposed system while indicating a greatly improved state-of-the-art method, proving its superiority in setting a new benchmark for sentiment classification tasks.
- Research Article
21
- 10.1016/j.jksuci.2023.101691
- Aug 3, 2023
- Journal of King Saud University - Computer and Information Sciences
Ensemble Stacking Model for Sentiment Analysis of Emirati and Arabic Dialects
- Research Article
11
- 10.1016/j.dib.2023.109452
- Jul 26, 2023
- Data in Brief
Regional languages are being used more frequently in online platforms as a result of the expanding use of digital technology. Understanding user opinions on social media platforms, forums, blogs, and other digital platforms that employ Indian regional languages has become significant due to their role in various applications. Research on sentiment analysis of Indian regional language texts suffers due to the unavailability of available regional language datasets. The curated Malayalam Aspect Based Sentiment Analysis (MABSA) dataset is a labeled dataset for Aspect Based Sentiment Analysis (ABSA) on the Indian regional language Malayalam over the movie review domain. Malayalam movie reviews, an excellent source of text data for ABSA, are collected from an online survey using Google form and manually collecting reviews from three social media platforms: IMDb, Facebook, and YouTube. Nine target aspects were identified, and three annotators annotated the dataset based on the sentiment polarity of each aspect. A total of 4000 reviews were collected, and a total of 7507 aspects are identified in the reviews. Spearman's correlation and Fleiss Kappa Index are used to analyze the annotated dataset's correlation. It has been found that the high correlation between the annotators implies that the MABSA dataset is of gold standard.
- Research Article
1
- 10.1016/j.dib.2025.112073
- Sep 19, 2025
- Data in Brief
MADTRAS: Dataset for aspect-based sentiment analysis of movie reviews in Tamil
- Research Article
99
- 10.3390/data3020015
- May 4, 2018
- Data
With the extensive growth of user interactions through prominent advances of the Web, sentiment analysis has obtained more focus from an academic and a commercial point of view. Recently, sentiment analysis in the Bangla language is progressively being considered as an important task, for which previous approaches have attempted to detect the overall polarity of a Bangla document. To the best of our knowledge, there is no research on the aspect-based sentiment analysis (ABSA) of Bangla text. This can be described as being due to the lack of available datasets for ABSA. In this paper, we provide two publicly available datasets to perform the ABSA task in Bangla. One of the datasets consists of human-annotated user comments on cricket, and the other dataset consists of customer reviews of restaurants. We also describe a baseline approach for the subtask of aspect category extraction to evaluate our datasets.
- Research Article
15
- 10.1016/j.jss.2022.111448
- Jul 21, 2022
- Journal of Systems and Software
On the subjectivity of emotions in software projects: How reliable are pre-labeled data sets for sentiment analysis?
- Conference Article
2
- 10.1145/3700706.3700729
- Aug 14, 2024
Optimization is a crucial process in training neural networks. Its goal is to find a set of parameters that offer the best performance for a specific problem. With the advent of deep learning models, which are primarily based on neural networks, interesting results have been achieved in NLP tasks, specifically in text classification. Furthermore, optimizing the parameters of these models is crucial to improve performance and reduce training time. Several optimization algorithms have been developed, such as those based on adaptive technique, which have proven their effectiveness by providing superior results compared to other models. In this work, a Convolutional Neural Network (CNN) model was developed, composed of multiple layers. Additionally, two Arabic datasets were used in this study: the first comprising ten different categories of text extracted from news platforms, and the second being a sentiment analysis dataset containing Arabic tweets of two polarities, positive and negative. Several optimizers do exist nowadays, in this study, we choose seven popular deep learning optimizers, namely, AdamW, Adamax, Lion, Nadam, Adam, Adam-Weight-Decay and RMSProp. The experimental outcomes demonstrate that the Adam and AdamW optimization algorithms delivered superior performance in comparison to other optimization techniques when applied to the task of topic classification. These two optimizers have proven to be highly effective, producing results that stand out among their counterparts. On the other hand, when it comes to sentiment analysis, the Nadam optimizer has shown to surpass the performance of the other optimizers. This indicates that the choice of optimizer can improve the performance of the results, and the optimal choice may vary depending on the task.
- Research Article
229
- 10.3390/app11093986
- Apr 28, 2021
- Applied Sciences
In the last decade, sentiment analysis has been widely applied in many domains, including business, social networks and education. Particularly in the education domain, where dealing with and processing students’ opinions is a complicated task due to the nature of the language used by students and the large volume of information, the application of sentiment analysis is growing yet remains challenging. Several literature reviews reveal the state of the application of sentiment analysis in this domain from different perspectives and contexts. However, the body of literature is lacking a review that systematically classifies the research and results of the application of natural language processing (NLP), deep learning (DL), and machine learning (ML) solutions for sentiment analysis in the education domain. In this article, we present the results of a systematic mapping study to structure the published information available. We used a stepwise PRISMA framework to guide the search process and searched for studies conducted between 2015 and 2020 in the electronic research databases of the scientific literature. We identified 92 relevant studies out of 612 that were initially found on the sentiment analysis of students’ feedback in learning platform environments. The mapping results showed that, despite the identified challenges, the field is rapidly growing, especially regarding the application of DL, which is the most recent trend. We identified various aspects that need to be considered in order to contribute to the maturity of research and development in the field. Among these aspects, we highlighted the need of having structured datasets, standardized solutions and increased focus on emotional expression and detection.