Extracting an Arabic Lexicon from Arabic Newspaper Text

Saleem Abuleil,Martha Evens

doi:10.1023/a:1014368121689

Abstract

We describe how to build a large comprehensive, integrated Arabic lexicon by automatic parsing of newspaper text. We have built a parser system to read Arabic newspaper articles, isolate the tokens from them, find the part of speech, and the features for each token. To achieve this goal we designed a set of algorithms, we generated several sets of rules, and we developed a set of techniques, and a set of components to carry out these techniques. As each sentence is processed, new words and features are added to the lexicon, so that it grows continuously as the system runs. To test the system we have used 100 articles (80,444 words) from the Al-Raya newspaper. The system consists of several modules: the tokenizer module to isolate the tokens, the type finder system to find the part of speech of each token, the proper noun phrase parser module to mark the proper nouns and to discover some information about them and the feature finder module to find the features of the words.

Full Text