Tashaphyne0.4: a new arabic light stemmer based on rhyzome modeling approach
Stemming algorithms are crucial tools for enhancing the information retrieval process in natural language processing. This paper presents a novel Arabic light stemming algorithm called Tashaphyne0.4, the idea behind this algorithm is to extract the most precise 'roots', and 'stems' from words of an Arabic text. Thus, the proposed algorithm acts as rooter, stemmer, and segmentation tools at the same time. Our approach involves tri-fold phases (i.e., Preparation, Stems-Extractor, and Root-Extractor). Tashaphyne0.4 has shown better results than six other stemmers (i.e., Khoja, ISRI, Motaz/Light10, Tashaphyne0.3, FARASA, and Assem stemmers). The comparison is performed using four different Arabic comprehensive-benchmarks datasets. In conclusion, our proposed stemmer achieved remarkable results and outperformed other competitive stemmers in extracting 'Roots' and 'Stems'.
- Research Article
12
- 10.7717/peerj-cs.530
- May 14, 2021
- PeerJ Computer Science
Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.
- Conference Article
2
- 10.1109/ictaacs48474.2019.8988118
- Dec 1, 2019
As known in the literature, light stemmers produce more under-stemming errors, while root stemmers produce more over-stemming errors. In this investigation, we deal with the Arabic light stemming problem, where we propose an improvement to ARLSTem algorithm (i.e. ARLSTem v1.1). In particular, we introduce new rules to correct some under-stemming errors produced by ARLSTem. In addition, we compare the new version of ARLSTem with five existing stemming algorithms using ARASTEM corpus. The latter has been corrected, where we have found some errors in seven samples. The experimental results showed that ARLSTem v1.1 outperforms the other existing algorithms in terms of under-stemming and over-stemming errors. Moreover, it presents interesting performances in the Arabic text categorization task.
- Conference Article
10
- 10.1109/icriis.2013.6716682
- Nov 1, 2013
According to the desired level of analyzing words, Arabic stemming algorithms can be classified into stem-based (light stemming algorithms), and root-based algorithms. Light stemming algorithms only remove prefixes and suffixes from the words, while root-based algorithms remove prefixes, suffixes and infixes. There are several light stemmers for Arabic (Light1, Light2, Light3, Light8, and Light10), For retrieval information Light10 stemmer is out-performed the other light stemmers. In this paper, Arabic stemming algorithms are studied. And, literature review of Arabic stemmers is discussed. In addition, a new Arabic light stemmer was proposed and Implemented. The main step of the light stemmer is removing the prefixes and suffixes of the words. And because this step causes changing of the meaning of some words, many other steps are designed and implemented in the proposed stemmer. The proposed stemmer and Light10 stemmer were tested on the same Arabic data which is developed in this work. The accuracy rate of Light10 stemmer was 66%, while the accuracy rate of the proposed stemmer was 88.25 %. The reasons for incorrect stemming of the proposed stemmer are mentioned.
- Conference Article
15
- 10.1109/snams.2019.8931842
- Oct 1, 2019
Arabic is a derived language that has a deep structure and words meaning, one of the Arabic challenges is its morphology dependency. Arabic Natural Language Processing (ANLP) tools are required to achieve many tasks, such as Machine learning. For the text classification task, the ANLP is considered as preprocessing steps. These preprocessing steps include but not limited to Stemming, Normalization, and Stop-words Removal. In this work, we collected 2,000 news articles from Arabic online newspapers, the data were classified using Support Vector Machine (SVM) and Nave Base (NB) classifiers. The classification task was conducted for the purpose of comparing three different Arabic light stemmers; P-Stemmer, Khoja Stemmer, and Light10 Stemmer. The P-Stemmer results was dominating the other two stemmers in both SVM and NB classifiers with accuracy of 0.92 for F1-measure in SVM classifier and 0.90 for F1-Measure in NB classifier.
- Book Chapter
9
- 10.1007/978-3-030-34614-0_4
- Nov 30, 2019
This chapter aims to study the effects of the light stemming technique on feature extraction where Bag of Words (BoW) and Term frequency- Inverse Documents (TF-IDF) are employed for Arabic document classification. Moreover, feature selection methods such as Chi-square (Chi2), Information gain (IG), and singular value decomposition (SVD) are used to select the most relevant features. K-nearest Neighbor (kNN), Logistic Regression (LR), and Support Vector Machine (SVM) classifiers are used to build the classification model. Experiment are conducted using a public data collected from Arab websites, namely, BBC Arabic dataset. Experiment results show that SVM outperforms LR and KNN. Furthermore, BoW outperforms TF-IDF without using a stemming technique. Using a Robust Arabic Light Stemmer (ARLStem) as our main light stemmer shows a positive effect when combined with TF-IDF over the baseline. In the experiment where Chi2 is used as the feature selection technique, SVM resulted in 0.9568% F1-micro using BoW to extract the features from the dataset where 5000 relevant features were selected. In the experiment where IG is used as the feature selection method, SVM achieved 0.9588% F1-micro with BoW and 4000 selected features. Finally in the experiment where SVD is used as the feature selection technique, SVM reached 0.9569% F1-micro when using BoW and 5000 relevant feature were selected. The aforementioned experiments report the best results achieved where stemming is not employed.
- Book Chapter
15
- 10.1007/978-1-4020-6046-5_13
- Jan 1, 2007
This chapter presents an adaptation of existing techniques in Arabic morphology by leveraging corpus statistics to make them suitable for Information Retrieval (IR). The adaptation resulted in the development of Sebawai, an shallow Arabic morphological analyzer, and Al-Stem, an Arabic light stemmer. Both were used to produce Arabic index terms for Arabic IR experimentation. Sebawai is concerned with generating possible roots and stems of given Arabic word along with probability estimates of deriving the word from each of the possible roots. The probability estimates were used a guide to determine which prefixes and suffixes should be used to build the light stemmer Al-Stem. The use of the Sebawai generated roots and stems as index terms along with the stems from Al-Stem are evaluated in an information retrieval application and the results are compared.
- Research Article
44
- 10.1080/0952813x.2016.1212100
- Jul 27, 2016
- Journal of Experimental & Theoretical Artificial Intelligence
The stemming is the process of transforming a word into its root or stem, hence, it is considered as a crucial pre-processing step before tackling any task of natural language processing or information retrieval. However, in the case of Arabic language, finding an effective stemming algorithm seems to be quite difficult, since the Arabic language has a specific morphology, which is different from many other languages. Although, there exist several algorithms in literature addressing the Arabic stemming issue, unfortunately, most of them are restricted to a limited number of words, present some confusions between original letters and affixes, and usually employ dictionary of words or patterns. For that purpose, we propose the design and implementation of a novel Arabic light stemmer, which is based on some new rules for stripping prefixes, suffixes and infixes in a smart way. And in our knowledge, it is the first work dealing with Arabic infixes with regards to their irregular rules. The empirical evaluation was conducted on a new Arabic data-set (called ARASTEM), which was conceived and collected from several Arabic discussion forums containing dialectical Arabic and modern pseudo-Arabic languages. Hence, we present a comparative investigation between our new stemmer and other existing stemmers using Paice’s parameters, namely: Under Stemming Index (UI), Over Stemming Index (OI) and Stemming Weight (SW). Results show that the proposed Arabic light stemmer maintains consistently high performances and outperforms several existing light stemmers.
- Research Article
23
- 10.1016/j.jksuci.2016.11.010
- Dec 2, 2016
- Journal of King Saud University - Computer and Information Sciences
Enhancing Arabic stemming process using resources and benchmarking tools
- Research Article
4
- 10.32604/cmc.2021.016155
- Jan 1, 2021
- Computers, Materials & Continua
This paper introduces a new enhanced Arabic stemming algorithm for solving the information retrieval problem, especially in medical documents. Our proposed algorithm is a light stemming algorithm for extracting stems and root... | Find, read and cite all the research you need on Tech Science Press
- Conference Article
8
- 10.1109/intellisys.2017.8324233
- Sep 1, 2017
Document clustering plays a vital role in text mining fields such as information retrieval, sentiment analysis, and text organizing. Document clustering aims to automatically divide a collection of documents based on some aspects of similarity into groups that are meaningful, useful or both. This paper aims to improve the clustering task for the Arabic documents. Recent studies show that partitioning clustering algorithms are more suitable for clustering process. However, k-means is the most common algorithm that is being used for clustering process because of its simplicity and speed. It can only generate an arbitrary solution because the results depend on the initial centers for the desired clusters “the seeds”. In this paper, a new modified k-means algorithm called PSO K-means, supported by Particle Swarm Optimization (PSO) is applied to enhance the Arabic document clustering process. Then, an intensive comparative study between the proposed model and the standard k-means algorithm is applied. Also, the stemming algorithms those are being used in Arabic language processing were assessed. Through the experiments, an evaluation for the new algorithm is done with three different Arabic data sets. The results demonstrate that the proposed model can produce more accurate results compared to the standard k-means algorithm for Arabic language documents. On the other hand, Arabic light stemmer is more suitable for the stemming step.
- Research Article
- 10.11591/ijece.v15i2.pp2356-2363
- Apr 1, 2025
- International Journal of Electrical and Computer Engineering (IJECE)
Our study introduces an innovative light stemming tool tailored for Arabic morphology challenges. In conformance with the templatic and concatenative structures, our stemmer utilizes a combination of clitic stripping, lexicon-based, and statistical disambiguation techniques to ensure accurate stemming. To accomplish this, we rely on our clitic rules lexicon to detect all potential combinations of clitics for each input entry. Subsequently, we depend on an extensive lexicon of over 7 million stems to verify the potential stems. Lastly, we employ a statistical model to ascertain the most likely stem based on the sentence's context. Experimental results demonstrate the effectiveness of the proposed stemmer in comparison with existing ones. Using different datasets, our stemmer achieves higher accuracy and F1 scores, highlighting its efficiency in Arabic stemming tasks.
- Research Article
18
- 10.1016/j.ipm.2005.07.002
- Sep 2, 2005
- Information Processing and Management
Stemming to improve translation lexicon creation form bitexts
- Book Chapter
12
- 10.1007/978-3-030-84532-2_4
- Jan 1, 2021
The process of stemming is considered as one of the most essential steps in natural language processing and retrieving information. Nevertheless, in Arabic language, the task of stemming remains a major challenge due to the fact that Arabic language has a particular morphology, thereby making it different from other languages. Majority of existing algorithms are limited to a given number of words, create ambiguity between original letters and affixes, and often make use of dictionary patterns or words. We therefore, for the first time, present a design and implementation of Arabic light stemmer based on Information Science Research Institute algorithm. The algorithm is evaluated empirically using a newly created Arabic dataset which was created using data from different Arabic websites with contents that have been written in modern Arabic language. The experimental results indicated that the proposed method outperforms when benchmarked with existing methods.
- Research Article
- 10.4018/ijcac.339563
- Feb 26, 2024
- International Journal of Cloud Applications and Computing
The identification of ambiguities in Arabic requirement documents plays a crucial role in requirements engineering. This is because the quality of requirements directly impacts the overall success of software development projects. Traditionally, engineers have used manual methods to evaluate requirement quality, leading to a time-consuming and subjective process that is prone to errors. This study explores the use of machine learning algorithms to automate the assessment of requirements expressed in natural language. The study aims to compare various machine learning algorithms according to their abilities in classifying requirements written in Arabic as decision tree. The findings reveal that random forest outperformed all stemmers, achieving an accuracy of 0.95 without employing a stemmer, 0.99 with the ISRI stemmer, and 0.97 with the Arabic light stemmer. These results highlight the robustness and practicality of the random forest algorithm.
- Research Article
44
- 10.1002/asi.23609
- Dec 23, 2015
- Journal of the Association for Information Science and Technology
Arabic news articles in electronic collections are difficult to study. Browsing by category is rarely supported. Although helpful machine‐learning methods have been applied successfully to similar situations for English news articles, limited research has been completed to yield suitable solutions for Arabic news. In connection with a Qatar National Research Fund (QNRF)‐funded project to build digital library community and infrastructure in Qatar, we developed software for browsing a collection of about 237,000 Arabic news articles, which should be applicable to other Arabic news collections. We designed a simple taxonomy for Arabic news stories that is suitable for the needs of Qatar and other nations, is compatible with the subject codes of the International Press Telecommunications Council, and was enhanced with the aid of a librarian expert as well as five Arabic‐speaking volunteers. We developed tailored stemming (i.e., a new Arabic light stemmer called P‐Stemmer) and automatic classification methods (the best being binary Support Vector Machines classifiers) to work with the taxonomy. Using evaluation techniques commonly used in the information retrieval community, including 10‐fold cross‐validation and the Wilcoxon signed‐rank test, we showed that our approach to stemming and classification is superior to state‐of‐the‐art techniques.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.