FEATURE SELECTION USING SINGULAR VALUE DECOMPOSITION AND ORTHOGONAL CENTROID FEATURE SELECTION FOR TEXT CLASSIFICATION

Hoan Dau Manh

doi:10.15623/ijret.2016.0505001

Abstract

Text mining is a narrow research field of data mining, which focuses on discovering new information from text document collections, mainly by using techniques from data mining, machine learning, natural language processing and information retrieval. Text classification is the process of analyzing text content and then giving decision whether this text can belong to one group, many groups or it does not belong to the text group which is defined before. On over the world, there have been many effective researches on this problem, especially on texts in English. However, there have been few researches on Vietnamese texts. Moreover, these researching results and applications are still limited partly due to the typical characteristics of Vietnamese language in term of words and sentences and there are many words with many meanings in many different contexts. Text classification problem is the one with many featues, thus to improve the effectiveness of text classification is the aim of may researchers. In this research, the author constructs two methods of feature selection: singular value decomposition and optimal orthogonal centroid feature selection in text classification with high efficiency of calculation proven on English text document and now they are proven on Vietnamese text document. There are many classification techniques, but we implemented on the learning machine algorithms support vector machines. This method has been proven to be effective for text classification problems. With the technique of feature selection singular value decomposition and optimal orthogonal centroid feature selection, the implementing result higher than that of traditional method.

Full Text