AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP

Similar Papers
  • Research Article
  • 10.1075/ml.24027.deg
NLP and education
  • Dec 31, 2024
  • The Mental Lexicon
  • Túlio Sousa De Gois + 3 more

This study examines the applicability of the Cloze test, a widely used tool for assessing text comprehension proficiency, while highlighting its challenges in large-scale implementation. To address these limitations, an automated correction approach was proposed, utilizing Natural Language Processing (NLP) techniques, particularly word embeddings (WE) models, to assess semantic similarity between expected and provided answers. Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure the semantic similarity of the responses. The results were validated through an experimental setup involving twelve judges who classified the students’ answers. A comparative analysis between the WE models’ scores and the judges’ evaluations revealed that GloVe was the most effective model, demonstrating the highest correlation with the judges’ assessments. This study underscores the utility of WE models in evaluating semantic similarity and their potential to enhance large-scale Cloze test assessments. Furthermore, it contributes to educational assessment methodologies by offering a more efficient approach to evaluating reading proficiency.

  • Research Article
  • Cite Count Icon 73
  • 10.1016/j.engappai.2019.07.010
A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art
  • Aug 1, 2019
  • Engineering Applications of Artificial Intelligence
  • Juan J Lastra-Díaz + 5 more

A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.jss.2024.111961
An empirical assessment of different word embedding and deep learning models for bug assignment
  • Jan 6, 2024
  • Journal of Systems and Software
  • Rongcun Wang + 5 more

An empirical assessment of different word embedding and deep learning models for bug assignment

  • Research Article
  • Cite Count Icon 1
  • 10.5075/epfl-thesis-7148
Word Embeddings for Natural Language Processing
  • Jan 1, 2016
  • Rémi Lebret

Word embedding is a feature learning technique which aims at mapping words from a vocabulary into vectors of real numbers in a low-dimensional space. By leveraging large corpora of unlabeled text, such continuous space representations can be computed for capturing both syntactic and semantic information about words. Word embeddings, when used as the underlying input representation, have been shown to be a great asset for a large variety of natural language processing (NLP) tasks. Recent techniques to obtain such word embeddings are mostly based on neural network language models (NNLM). In such systems, the word vectors are randomly initialized and then trained to predict optimally the contexts in which the corresponding words tend to appear. Because words occurring in similar contexts have, in general, similar meanings, their resulting word embeddings are semantically close after training. However, such architectures might be challenging and time-consuming to train. In this thesis, we are focusing on building simple models which are fast and efficient on large-scale datasets. As a result, we propose a model based on counts for computing word embeddings. A word co-occurrence probability matrix can easily be obtained by directly counting the context words surrounding the vocabulary words in a large corpus of texts. The computation can then be drastically simplified by performing a Hellinger PCA of this matrix. Besides being simple, fast and intuitive, this method has two other advantages over NNLM. It first provides a framework to infer unseen words or phrases. Secondly, all embedding dimensions can be obtained after a single Hellinger PCA, while a new training is required for each new size with NNLM. We evaluate our word embeddings on classical word tagging tasks and show that we reach similar performance than with neural network based word embeddings. While many techniques exist for computing word embeddings, vector space models for phrases remain a challenge. Still based on the idea of proposing simple and practical tools for NLP, we introduce a novel model that jointly learns word embeddings and their summation. Sequences of words (i.e. phrases) with different sizes are thus embedded in the same semantic space by just averaging word embeddings. In contrast to previous methods which reported a posteriori some compositionality aspects by simple summation, we simultaneously train words to sum, while keeping the maximum information from the original vectors. These word and phrase embeddings are then used in two different NLP tasks: document classification and sentence generation. Using such word embeddings as inputs, we show that good performance is achieved in sentiment classification of short and long text documents with a convolutional neural network. Finding good compact representations of text documents is crucial in classification systems. Based on the summation of word embeddings, we introduce a method to represent documents in a low-dimensional semantic space. This simple operation, along with a clustering method, provides an efficient framework for adding semantic information to documents, which yields better results than classical approaches for classification. Simple models for sentence generation can also be designed by leveraging such phrase embeddings. We propose a phrase-based model for image captioning which achieves similar results than those obtained with more complex models. Not only word and phrase embeddings but also embeddings for non-textual elements can be helpful for sentence generation. We, therefore, explore to embed table elements for generating better sentences from structured data. We experiment this approach with a large-scale dataset of biographies, where biographical infoboxes were available. By parameterizing both words and fields as vectors (embeddings), we significantly outperform a classical model.

  • Research Article
  • Cite Count Icon 5
  • 10.17485/ijst/2016/v9i45/106478
Word Embedding Models for Finding Semantic Relationship between Words in Tamil Language
  • Dec 7, 2016
  • Indian Journal of Science and Technology
  • S G Ajay + 3 more

Objective: Word embedding models were most predominantly used in many of the NLP tasks such as document classification, author identification, story understanding etc. In this paper we make a comparison of two Word embedding models for semantic similarity in Tamil language. Each of those two models has its own way of predicting relationship between words in a corpus. Method/Analysis: The term Word embedding in Natural Language Processing is a representation of words in terms of vectors. Word embedding is used as an unsupervised approach instead of traditional way of feature extraction. Word embedding models uses neural networks to generate numerical representation for the given words. In order to find the best model that captures semantic relationship between words, using a morphologically rich language like Tamil would be great. Tamil language is one of the oldest Dravidian languages and it is known for its morphological richness. In Tamil language it is possible to construct 10,000 words from a single root word. Findings: Here we make comparison of Content based Word embedding and Context based Word embedding models respectively. We tried different feature vector sizes for the same word to comment on the accuracy of the models for semantic similarity. Novelty/Improvement: Analysing Word embedding models for morphologically rich language like Tamil helps us to classify the words better based on its semantics. Keywords: CBOW, Content based Word Embedding, Context based Word Embedding, Morphology, Semantic and Syntactic, Skip Gram

  • Research Article
  • Cite Count Icon 135
  • 10.1007/s10462-023-10419-1
Impact of word embedding models on text analytics in deep learning environment: a review.
  • Feb 22, 2023
  • Artificial Intelligence Review
  • Deepak Suresh Asudani + 2 more

The selection of word embedding and deep learning models for better outcomes is vital. Word embeddings are an n-dimensional distributed representation of a text that attempts to capture the meanings of the words. Deep learning models utilize multiple computing layers to learn hierarchical representations of data. The word embedding technique represented by deep learning has received much attention. It is used in various natural language processing (NLP) applications, such as text classification, sentiment analysis, named entity recognition, topic modeling, etc. This paper reviews the representative methods of the most prominent word embedding and deep learning models. It presents an overview of recent research trends in NLP and a detailed understanding of how to use these models to achieve efficient results on text analytics tasks. The review summarizes, contrasts, and compares numerous word embedding and deep learning models and includes a list of prominent datasets, tools, APIs, and popular publications. A reference for selecting a suitable word embedding and deep learning approach is presented based on a comparative analysis of different techniques to perform text analytics tasks. This paper can serve as a quick reference for learning the basics, benefits, and challenges of various word representation approaches and deep learning models, with their application to text analytics and a future outlook on research. It can be concluded from the findings of this study that domain-specific word embedding and the long short term memory model can be employed to improve overall text analytics task performance.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-15-0058-9_42
Word Embedding for Small and Domain-specific Malay Corpus
  • Jan 1, 2020
  • Sabrina Tiun + 3 more

In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.

  • Research Article
  • Cite Count Icon 5
  • 10.34028/iajit/21/2/13
Word Embedding as a Semantic Feature Extraction Technique in Arabic Natural Language Processing: An Overview
  • Jan 1, 2024
  • The International Arab Journal of Information Technology
  • Ghizlane Bourahouat + 2 more

Feature extraction has transformed the field of Natural Language Processing (NLP) by providing an effective way to represent linguistic features. Various techniques are utilised for feature extraction, such as word embedding. This latter has emerged as a powerful technique for semantic feature extraction in Arabic Natural Language Processing (ANLP). Notably, research on feature extraction in the Arabic language remains relatively limited compared to English. In this paper, we present a review of recent studies focusing on word embedding as a semantic feature extraction technique applied in Arabic NLP. The review primarily includes studies on word embedding techniques applied to the Arabic corpus. We collected and analysed a selection of journal papers published between 2018 and 2023 in this field. Through our analysis, we categorised the different feature extraction techniques, identified the Machine Learning (ML) and/or Deep Learning (DL) algorithms employed, and assessed the performance metrics utilised in these studies. We demonstrate the superiority of word embeddings as a semantic feature representation in ANLP. We compare their performance with other feature extraction techniques, highlighting the ability of word embeddings to capture semantic similarities, detect contextual associations, and facilitate a better understanding of Arabic text. Consequently, this article provides valuable insights into the current state of research in word embedding for Arabic NLP.

  • Research Article
  • Cite Count Icon 1
  • 10.34028/iajit/21/2/12
Word Embedding as a Semantic Feature Extraction Technique in Arabic Natural Language Processing: An Overview
  • Jan 1, 2024
  • The International Arab Journal of Information Technology
  • Jun Wang + 1 more

Feature extraction has transformed the field of Natural Language Processing (NLP) by providing an effective way to represent linguistic features. Various techniques are utilised for feature extraction, such as word embedding. This latter has emerged as a powerful technique for semantic feature extraction in Arabic Natural Language Processing (ANLP). Notably, research on feature extraction in the Arabic language remains relatively limited compared to English. In this paper, we present a review of recent studies focusing on word embedding as a semantic feature extraction technique applied in Arabic NLP. The review primarily includes studies on word embedding techniques applied to the Arabic corpus. We collected and analysed a selection of journal papers published between 2018 and 2023 in this field. Through our analysis, we categorised the different feature extraction techniques, identified the Machine Learning (ML) and/or Deep Learning (DL) algorithms employed, and assessed the performance metrics utilised in these studies. We demonstrate the superiority of word embeddings as a semantic feature representation in ANLP. We compare their performance with other feature extraction techniques, highlighting the ability of word embeddings to capture semantic similarities, detect contextual associations, and facilitate a better understanding of Arabic text. Consequently, this article provides valuable insights into the current state of research in word embedding for Arabic NLP.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.procs.2024.10.176
Arabic Idioms Detection by Utilizing Deep Learning and Transformer-based Models
  • Jan 1, 2024
  • Procedia Computer Science
  • Hanen Himdi

Arabic Idioms Detection by Utilizing Deep Learning and Transformer-based Models

  • Research Article
  • 10.34028/21/2/13
Test
  • Jan 1, 2024
  • The International Arab Journal of Information Technology
  • Ghizlane Bourahoua + 2 more

Feature extraction has transformed the field of Natural Language Processing (NLP) by providing an effective way to represent linguistic features. Various techniques are utilised for feature extraction, such as word embedding. This latter has emerged as a powerful technique for semantic feature extraction in Arabic Natural Language Processing (ANLP). Notably, research on feature extraction in the Arabic language remains relatively limited compared to English. In this paper, we present a review of recent studies focusing on word embedding as a semantic feature extraction technique applied in Arabic NLP. The review primarily includes studies on word embedding techniques applied to the Arabic corpus. We collected and analysed a selection of journal papers published between 2018 and 2023 in this field. Through our analysis, we categorised the different feature extraction techniques, identified the Machine Learning (ML) and/or Deep Learning (DL) algorithms employed, and assessed the performance metrics utilised in these studies. We demonstrate the superiority of word embeddings as a semantic feature representation in ANLP. We compare their performance with other feature extraction techniques, highlighting the ability of word embeddings to capture semantic similarities, detect contextual associations, and facilitate a better understanding of Arabic text. Consequently, this article provides valuable insights into the current state of research in word embedding for Arabic NLP.

  • Research Article
  • Cite Count Icon 5
  • 10.14419/ijet.v10i2.31538
BnVec: Towards the Development of Word Embedding for Bangla Language Processing
  • May 24, 2021
  • International Journal of Engineering & Technology
  • Md Kowsher + 6 more

Progression in machine learning and statistical inference are facilitating the advancement of domains like computer vision, natural language processing (NLP), automation & robotics, and so on. Among the different persuasive improvements in NLP, word embedding is one of the most used and revolutionary techniques. In this paper, we manifest an open-source library for Bangla word extraction systems named BnVec which expects to furnish the Bangla NLP research community by the utilization of some incredible word embedding techniques. The BnVec is splitted up into two parts, the first one is the Bangla suitable defined class to embed words with access to the six most popular word embedding schemes (CountVectorizer, TF-IDF, Hash Vectorizer, Word2vec, fastText, and Glove). The other one is based on the pre-trained distributed word embedding system of Word2vec, fastText, and GloVe. The pre-trained models have been built by collecting content from the newspaper, social media, and Bangla wiki articles. The total number of tokens used to build the models exceeds 395,289,960. The paper additionally depicts the performance of these models by various hyper-parameter tuning and then analyzes the results.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-981-15-8335-3_21
Comparison of Various Word Embeddings for Hate-Speech Detection
  • Jan 1, 2021
  • Minni Jain + 3 more

Word Embedding plays a crucial role in natural language processing, and other related domains. The vast variety of language modelling and feature learning techniques often concludes in a quandary. The motivation behind this work was to produce comparative analysis among these methods and finally use them to flag hate-speech on social media. The progress in these word embedding techniques has led to remarkable results by incorporating various natural language applications. Understanding the different context of polysemous words is one of the features that evolved over time with these word embedding models. A systematic review on varying word embedding methodologies has been performed in this paper. Various experimental metrics have been used and detailed analysis has been done on each word embedding model. It is shown that analysis involves various aspects of the model like dealing with multi-sense words, and rarely occurring words, etc., and finally a coherent analysis report is presented. The various models under analysis are—Word2Vec (Skip-Gram, CBOW), GloVe, Fast-Text and ELMo. These models are then put to a real-life application in the form of Hate Speech detection of twitter data, and their individual capacities and accuracies are compared. Through this paper we show how ELMo uses different word embeddings for polysemous words to capture the context. We show how Hate speech can be better detected by ELMo because such speech requires better understanding of context of words for segregation from normal speech/text.KeywordsWord2VecSkip-gramCBOWGloVeFast-textElmoWord embeddingHate-speech detection

  • Conference Article
  • Cite Count Icon 2
  • 10.1145/3395032.3395325
FacetE
  • Jun 19, 2020
  • Michael Günther + 3 more

Today's natural language processing and information retrieval systems heavily depend on word embedding techniques to represent text values. However, given a specific task deciding for a word embedding dataset is not trivial. Current word embedding evaluation methods mostly provide only a one-dimensional quality measure, which does not express how knowledge from different domains is represented in the word embedding models. To overcome this limitation, we provide a new evaluation data set called FacetE derived from 125M Web tables, enabling domain-sensitive evaluation. We show that FacetE can effectively be used to evaluate word embedding models. The evaluation of common general-purpose word embedding models suggests that there is currently no best word embedding for every domain.

  • Research Article
  • Cite Count Icon 7
  • 10.7717/peerj-cs.384
Comparing general and specialized word embeddings for biomedical named entity recognition
  • Feb 18, 2021
  • PeerJ Computer Science
  • Rigo E Ramos-Vargas + 2 more

Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.