Embedding Extraction for Arabic Text Using the AraBERT Model

Amira Hamed Abo-Elghit,Taher Hamza,Aya Al-Zoghby

doi:10.32604/cmc.2022.025353

Amira Hamed Abo-Elghit, Taher Hamza + Show 1 more

Open Access

https://doi.org/10.32604/cmc.2022.025353

Copy DOI

Abstract

Nowadays, we can use the multi-task learning approach to train a machine-learning algorithm to learn multiple related tasks instead of training it to solve a single task. In this work, we propose an algorithm for estimating textual similarity scores and then use these scores in multiple tasks such as text ranking, essay grading, and question answering systems. We used several vectorization schemes to represent the Arabic texts in the SemEval2017-task3-subtask-D dataset. The used schemes include lexical-based similarity features, frequency-based features, and pre-trained model-based features. Also, we used contextual-based embedding models such as Arabic Bidirectional Encoder Representations from Transformers (AraBERT). We used the AraBERT model in two different variants. First, as a feature extractor in addition to the text vectorization schemes’ features. We fed those features to various regression models to make a prediction value that represents the relevancy score between Arabic text units. Second, AraBERT is adopted as a pre-trained model, and its parameters are fine-tuned to estimate the relevancy scores between Arabic textual sentences. To evaluate the research results, we conducted several experiments to compare the use of the AraBERT model in its two variants. In terms of Mean Absolute Percentage Error (MAPE), the results show minor variance between AraBERT v0.2 as a feature extractor (21.7723) and the fine-tuned AraBERT v2 (21.8211). On the other hand, AraBERT v0.2-Large as a feature extractor outperforms the fine-tuned AraBERT v2 model on the used data set in terms of the coefficient of determination () values (0.014050,−0.032861), respectively.

Highlights

The textual similarity is a critical topic in Natural Language Processing (NLP)
We conclude that Arabic Bidirectional Encoder Representations from Transformers (AraBERT) v0.2-Large as a feature extractor model with AdaBoost has the highest value in terms of R2 and the variance in the Mean Absolute Percentage Error (MAPE) values between it and others is minor
The AraBERT v0.2-Large as a feature extractor outperforms the fine-tuned AraBERT v2 model on the used data set in terms of R2

Summary

Introduction

The frequency-based word embeddings approach is the traditional text modeling, which is based on the BOW representation. It contains One Hot Encoding (OHE), Hashing Vectorization, Part Of Speech (POS) Weighting [5], Word Counts, Term Frequency-Inverse Document Frequency (TFIDF) [4], and N-grams [6]. These vectorization techniques of text representation work well; they fail to keep a semantic relation between words or the meaning of a text, not considering the context in which a word appears. It, does not consider the relations between multiple words and the overall sentences’ meanings or context within the text

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computers, Materials & Continua	Publication Date: Jan 1, 2022
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Embedding Extraction for Arabic Text Using the AraBERT Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers, Materials & Continua

Lead the way for us

Similar Papers

Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets
Seid Muhie Yimam ... Gopalakrishnan Venkatesh
Future Internet | VOL. 13
Seid Muhie Yimam, et. al.Seid Muhie Yimam ... Gopalakrishnan Venkatesh
27 Oct 2021
Future Internet | VOL. 13

Can using a pre-trained deep learning model as the feature extractor in the bag-of-deep-visual-words model always improve image classification accuracy?
Ye Xu ... Chongpeng Huang
PloS one | VOL. 19
Ye Xu, et. al.Ye Xu ... Chongpeng Huang
29 Feb 2024
PloS one | VOL. 19

Online Automated Essay Grading System as a Web Based Learning (WBL) Tool in Engineering Education
Siddhartha Ghosh
-
Siddhartha GhoshSiddhartha Ghosh
01 Jan 2009
01 Jan 2009

Design of an Automated Essay Grading (AEG) System in Indian Context
Siddhartha Ghosh ... Dr Sameen S Fatima
International Journal of Computer Applications | VOL. 1
Siddhartha Ghosh, et. al.Siddhartha Ghosh ... Dr Sameen S Fatima
25 Feb 2010
International Journal of Computer Applications | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Embedding Extraction for Arabic Text Using the AraBERT Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers, Materials &amp; Continua

More From: Computers, Materials & Continua