Statistical Bayesian Learning for Automatic Arabic Text Categorization

Al-Salemi Al-Salemi

doi:10.3844/jcssp.2011.39.45

Al-Salemi Al-Salemi

Open Access

https://doi.org/10.3844/jcssp.2011.39.45

Copy DOI

Abstract

Problem statement: The rapid increasing of online Arabic documents necessitated applying Text Categorization techniques that are commonly used for English language to categorize them automatically. The complex morphology of Arabic language and its large vocabulary size make applying these techniques directly difficult and costly in time and effort. Approach: We have investigated Bayesian learning models in order to enhance Arabic ATC. Three classifiers based on Bayesian theorem had been implemented which are Simple Naive Bayes (NB), Multi-variant Bernoulli Naive Bayes (MBNB) and Multinomial Naive Bayes (MNB) models. TREC-2002 Light Stemmer was applied for Arabic stemming. For text representation, Bag-Of-Word and character-level n-gram with the length 3, 4 and 5 are used. In order to reduce the dimensionality of feature space, the following feature selection methods: Mutual Information, Chi-Square statistic, Odds Ratio and GSS-coefficient were used. Conclusion: MBNB classifier outperformed both of NB and MNB classifiers. BOW representation leads to the best classification performance; nevertheless, using character-level n-gram leads to satisfying results for Arabic ATC based on Bayesian learning. Moreover, the use of feature selection methods dramatically increases the categorization performance.

Highlights

Automatic Text Categorization (ATC) is the task of and Nigam, 1998; Schneider, 2003; Mendez et al, 2008; Yang and Pedersen, 1997) are probabilistic models, which all apply Bayesian theorem while the way of assigning a given document to its predefined category computing the probability is different.Machine Learning (ML) approach automatically
Instead of using the classical models of text category and process them by several Information classification that consist of a set of logical rules defined Retrieval (IR) techniques to extract a set of features used manually, Machine Learning (ML) approach had been as characteristics for each category
The most common test and evaluate the performance of the classifier by Supervised ML algorithms are Statistical Learning classifying the documents under each category as unseen algorithms, which provide a probability that a given documents and compare the estimated categories to document being assigned to particular classes based on the pre-defined ones to measure the classification probabilistic model

Summary

Introduction

Automatic Text Categorization (ATC) is the task of and Nigam, 1998; Schneider, 2003; Mendez et al, 2008; Yang and Pedersen, 1997) are probabilistic models, which all apply Bayesian theorem while the way of assigning a given document to its predefined category computing the probability is different.ML approach automatically. Instead of using the classical models of text category and process them by several Information classification that consist of a set of logical rules defined Retrieval (IR) techniques to extract a set of features used manually, Machine Learning (ML) approach had been as characteristics for each category. The most common test and evaluate the performance of the classifier by Supervised ML algorithms are Statistical Learning classifying the documents under each category as unseen algorithms, which provide a probability that a given documents and compare the estimated categories to document being assigned to particular classes based on the pre-defined ones to measure the classification probabilistic model

Objectives

Results

Conclusion