Abstract
Topic identification is used in several applications, as adapting language models for speech recognition and machine translation, focusing on a specific use for search engines, etc. Topic identification consists to assign one or several topic labels to a flow of textual data. Labels are chosen from a set of topics fixed a priori. In this paper, we present a study about identifying topics of Arabic texts. For this, a considerable amount of data is needed. Thus, we started by collecting texts from the website of the Omani newspaper “Alwatan”. The result is an Arabic corpus composed of more than 9000 articles corresponding to nearly 10 millions words. The considered topics in our experiments are: Culture, Religion, Economy, Local news, International news and sports. Some of the methods presented in this study, are well known in the text categorization community, as TFIDF classifier and kNN “k Nearest Neighbors”. The objective to use these methods is to compare them to TR-classifier “TRiggers-based classifier”, a new method that we have proposed, which is based on computing triggers or the Average Mutual Information of each couple of words. In order to enhance performances, we have combined results of the three methods by using three approaches: Majority Vote, Enhanced Majority Vote and Linear Combination.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.