Abstract

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.

Highlights

  • Text classification is one of the important subfields of text mining, which recently gains increasing attention with the rapid development of web applications, such as social network

  • Distance Variance Score (DVS) is efficient in selecting discriminative features from document term matrix (DTM) because feature distance contribution pays much attention to nonzero values and is able to avoid negative effects caused by 0 values

  • Experimental results in this paper show that DVS is much better than Laplacian Score (LS) in feature selection of text classification, especially as N is reduced to a small number

Read more

Summary

Introduction

Text classification is one of the important subfields of text mining, which recently gains increasing attention with the rapid development of web applications, such as social network. The problem of text classification can be described as follows. N text documents are often represented as a document term matrix DTM = [D1, D2, . N) denotes the ith text document among the n text documents. M) corresponds to a term in the n text documents and fij is term frequency of fj in Di. Consider a simple example of two short text documents. The first document is “You are beautiful!” and the second is “Good morning, you guys!”. DTM of two documents is shown in (2). D1 = [1, 1, 0, 0, 0, 1] and D1[1] = f11 = 1 show that the first feature “are” exists once in the first text document (“You are beautiful!”). Di is labeled with a class c in which

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.