Abstract
Nowadays, we can use the multi-task learning approach to train a machine-learning algorithm to learn multiple related tasks instead of training it to solve a single task. In this work, we propose an algorithm for estimating textual similarity scores and then use these scores in multiple tasks such as text ranking, essay grading, and question answering systems. We used several vectorization schemes to represent the Arabic texts in the SemEval2017-task3-subtask-D dataset. The used schemes include lexical-based similarity features, frequency-based features, and pre-trained model-based features. Also, we used contextual-based embedding models such as Arabic Bidirectional Encoder Representations from Transformers (AraBERT). We used the AraBERT model in two different variants. First, as a feature extractor in addition to the text vectorization schemes’ features. We fed those features to various regression models to make a prediction value that represents the relevancy score between Arabic text units. Second, AraBERT is adopted as a pre-trained model, and its parameters are fine-tuned to estimate the relevancy scores between Arabic textual sentences. To evaluate the research results, we conducted several experiments to compare the use of the AraBERT model in its two variants. In terms of Mean Absolute Percentage Error (MAPE), the results show minor variance between AraBERT v0.2 as a feature extractor (21.7723) and the fine-tuned AraBERT v2 (21.8211). On the other hand, AraBERT v0.2-Large as a feature extractor outperforms the fine-tuned AraBERT v2 model on the used data set in terms of the coefficient of determination () values (0.014050,−0.032861), respectively.
Highlights
The textual similarity is a critical topic in Natural Language Processing (NLP)
We conclude that Arabic Bidirectional Encoder Representations from Transformers (AraBERT) v0.2-Large as a feature extractor model with AdaBoost has the highest value in terms of R2 and the variance in the Mean Absolute Percentage Error (MAPE) values between it and others is minor
The AraBERT v0.2-Large as a feature extractor outperforms the fine-tuned AraBERT v2 model on the used data set in terms of R2
Summary
The frequency-based word embeddings approach is the traditional text modeling, which is based on the BOW representation. It contains One Hot Encoding (OHE), Hashing Vectorization, Part Of Speech (POS) Weighting [5], Word Counts, Term Frequency-Inverse Document Frequency (TFIDF) [4], and N-grams [6]. These vectorization techniques of text representation work well; they fail to keep a semantic relation between words or the meaning of a text, not considering the context in which a word appears. It, does not consider the relations between multiple words and the overall sentences’ meanings or context within the text
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.