Part of Speech Tagging for Arabic Long Sentence

Ahmed H Aliwy,Duaa A Al_Raza

doi:10.14419/ijet.v7i3.27.17671

Abstract

Part Of Speech (POS) tagging of Arabic words is a difficult and non-travail task it was studied in details for the last twenty years and its performance affects many applications and tasks in area of natural language processing (NLP). The sentence in Arabic language is very long compared with English sentence. This affect tagging process for any approach deals with complete sentence at once as in Hidden Markov Model HMM tagger. In this paper, new approach is suggested for using HMM and n-grams taggers for tagging Arabic words in a long sentence. The suggested approach is very simple and easy to implement. It is implemented on data set of 1000 documents of 526321 tokens annotated manually (containing punctuations). The results shows that the suggested approach has higher accuracy than HMM and n-gram taggers. The F-measures were 0.888, 0.925 and 0.957 for n-grams, HMM and the suggested approach respectively.

Highlights

In this work, we try to use new method for Part Of Speech (POS) tagging which has prosperities of Hidden Markov Model (HMM) and N-grams to used for long Arabic sentence with high precision.Part of speech tagging, called word category disambiguation, is choosing the right tag to each word in the sentence from finite set of tags[1][2]
We try to use new method for POS tagging which has prosperities of HMM and N-grams to used for long Arabic sentence with high precision
Rule-based, Hmm, Relaxation labeling, Transformation-Based tagging (Brill), Genetic Algorithms, ngrams Model, Decision trees, Memory based learning, Cyclic Dependency Network, Neural networks tagger, Support Vector Machines, Fuzzy set theory, Boosting, Best match, Maximum Entropy tagger and Combining different taggers are techniques used to tagging system and almost all of them was applied to Arabic language but they need to suitable amount of training corpus

Summary

Introduction

We try to use new method for POS tagging which has prosperities of HMM and N-grams to used for long Arabic sentence with high precision. In most cases, a labeled data or corpus is used for training the supervised methods of POS tagging [1] It can be as classification problem where the tagset as classes and the words as the input. Rule-based, Hmm, Relaxation labeling, Transformation-Based tagging (Brill), Genetic Algorithms, ngrams Model, Decision trees, Memory based learning, Cyclic Dependency Network, Neural networks tagger, Support Vector Machines, Fuzzy set theory, Boosting, Best match, Maximum Entropy tagger and Combining different taggers are techniques used to tagging system and almost all of them was applied to Arabic language but they need to suitable amount of training corpus. The evaluation of POS taggers estimated using precision, recall and f-measure These measures depend on the nature of the language, size of the tagset, size of the used corpus and other factors. This make the task of POS tagging is more complicated than other languages

Related Work

Theory Background

The Suggested Approach

Results