Evaluation of Topic Identification Methods for Arabic Texts and their Combination by using a Corpus Extracted from the Omani Newspaper Alwatan

Mourad Abbas,Kamel Smaili,Daoud Berkani

doi:10.51758/agjsr-3/4-2011-0017

Abstract

Topic identification is used in several applications, as adapting language models for speech recognition and machine translation, focusing on a specific use for search engines, etc. Topic identification consists to assign one or several topic labels to a flow of textual data. Labels are chosen from a set of topics fixed a priori. In this paper, we present a study about identifying topics of Arabic texts. For this, a considerable amount of data is needed. Thus, we started by collecting texts from the website of the Omani newspaper “Alwatan”. The result is an Arabic corpus composed of more than 9000 articles corresponding to nearly 10 millions words. The considered topics in our experiments are: Culture, Religion, Economy, Local news, International news and sports. Some of the methods presented in this study, are well known in the text categorization community, as TFIDF classifier and kNN “k Nearest Neighbors”. The objective to use these methods is to compare them to TR-classifier “TRiggers-based classifier”, a new method that we have proposed, which is based on computing triggers or the Average Mutual Information of each couple of words. In order to enhance performances, we have combined results of the three methods by using three approaches: Majority Vote, Enhanced Majority Vote and Linear Combination.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluation of Topic Identification Methods for Arabic Texts and their Combination by using a Corpus Extracted from the Omani Newspaper Alwatan

Abstract

Talk to us

Similar Papers

More From: Arab Gulf Journal of Scientific Research

Lead the way for us

Journal: Arab Gulf Journal of Scientific Research	Publication Date: Dec 1, 2011
Citations: 1

Similar Papers

Integration of Speech Recognition and Machine Translation in Computer-Assisted Translation
Shahram Khadivi ... Hermann Ney
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 16
Shahram Khadivi, et. al.Shahram Khadivi ... Hermann Ney
01 Nov 2008
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 16

Machine translation based data augmentation for Cantonese keyword spotting
Guangpu Huang ... Arseniy Gorin
-
Guangpu Huang, et. al.Guangpu Huang ... Arseniy Gorin
01 Mar 2016
01 Mar 2016

Document-based Dirichlet class language model for speech recognition using document-based n-gram events
Md Akmal Haidar ... Douglas O'Shaughnessy
-
Md Akmal Haidar, et. al.Md Akmal Haidar ... Douglas O'Shaughnessy
01 Dec 2014
01 Dec 2014

Dynamic out-of-vocabulary word registration to language model for speech recognition
Norihide Kitaoka ... Bohan Chen
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2021
Norihide Kitaoka, et. al.Norihide Kitaoka ... Bohan Chen
25 Jan 2021
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluation of Topic Identification Methods for Arabic Texts and their Combination by using a Corpus Extracted from the Omani Newspaper Alwatan

Abstract

Talk to us

Similar Papers

More From: Arab Gulf Journal of Scientific Research