Some Investigations on Machine Learning Techniques for Automated Text Categorization

Bhagirath Prajapati,N C Chauhan,Sanjay Garg

doi:10.5120/12340-8617

Abstract

ABSTRACT The automated categorization (classification) of texts into predefined categories is one of the widely explored fields of research in text mining. Now-a-days, availability of digital data is very high, and to manage them in predefined categories has become a challenging task. Machine learning technique is an approach by which we can train automated classifier to classify the documents with minimum human assistance. This paper discusses the Naive Bayes, Rocchio, k-Nearest Neighborhood and Support Vector Machine methods within machine learning paradigm for automated text categorization of given documents in predefined categories. Keywords Machine learning, Text categorization. 1. INTRODUCTION In this cyber age, availability of digital document is increased drastically. Accessing the documents in convenient way has become difficult task as number and size of documents growing day by day. One such task is Text Categorization (TC), which means to label natural language text in predefined categories. Earlier Knowledge Engineering (KE) techniques were used for TC. KE is used in expert system which consists of manually defined logical rules of Disjunctive Normal Form (DNF) of type: if ( DNF formula) then (category); unlabeled documents into its appropriate category with high Document can be classified under particular category only if it satisfies the rule. The drawback of is this approach is the knowledge acquisition bottleneck. It is process in which expert person have to form DNF for new category. In last decade, the Machine Learning (ML) approach has gained popularity. In this approach, a general inductive process (learner) automatically builds a classifier. Learner automatically classifies document in predefined categories. Automated text categorization of documents in predefined categories is becoming popular in this digital age. Because now-a-days availability of digital documents increase dramatically, it becomes necessary to investigate and develop novel techniques for automated text categorization. Automated text categorization is applicable in document organization. In document organization documents has to be categorized in appropriate category. For example, in news paper agency the incoming advertisement has to be classified in one of the category like real estate, car for sale, office on rent. If such task is done manually than it would take lots of time. Automated systems are required that can accept the advertisement as input and categorize it to one of predefined categories. Let us take one more example where classifying the dynamic collection of text is to be done. Consider the example of e-mail filtering, where the computerized system is trained on “spam” mails to filter it out from non-spam mails [2]. Machine learning is an area of artificial intelligence. Machine learning deals with the study of methods for making computers learn like humans. Automated techniques from AI and machine learning have been developed to handle many problems of pattern recognition/categorization. One such task is text categorization of documents. Traditionally text categorization task is being carried out by KE techniques. In KE techniques human assistant is needed for forming decision rules for categorizing individual categories. So the idea is to explore the application of machine learning techniques for automated text categorization [2, 3], which can be free from human interactions. Automated text categorization with machine learning gained a prominent status in the information systems field. In this technique, a learner is implemented which automatically learn from previously classified documents. As discussed from this applications and importance of automated text categorization system encourages implementing computerized system which can classify incoming documents into appropriate predefined category. In implementing automated text categorization system the technique of machine learning algorithm which can be trained for some labeled documents and able to classify the incoming accuracy is to be explored. There are many machine learning algorithms available to build a learner for text categorization system. So it is interesting to implement few popular techniques of classification of text and to perform a comparative analysis in term of accuracy for such techniques. This paper deals with following objects. The first is to explore basic preprocessing steps for text categorization. It also presents the study of some machine learning techniques for text classification. The paper also focuses on investigation of performance issues in text categorization.

Full Text