Distance Variance Score: An Efficient Feature Selection Method in Text Classification

Heyong Wang,Ming Hong

doi:10.1155/2015/695720

Abstract

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.

Highlights

Text classification is one of the important subfields of text mining, which recently gains increasing attention with the rapid development of web applications, such as social network
Distance Variance Score (DVS) is efficient in selecting discriminative features from document term matrix (DTM) because feature distance contribution pays much attention to nonzero values and is able to avoid negative effects caused by 0 values
Experimental results in this paper show that DVS is much better than Laplacian Score (LS) in feature selection of text classification, especially as N is reduced to a small number

Summary

Introduction

Text classification is one of the important subfields of text mining, which recently gains increasing attention with the rapid development of web applications, such as social network. The problem of text classification can be described as follows. N text documents are often represented as a document term matrix DTM = [D1, D2, . N) denotes the ith text document among the n text documents. M) corresponds to a term in the n text documents and fij is term frequency of fj in Di. Consider a simple example of two short text documents. The first document is “You are beautiful!” and the second is “Good morning, you guys!”. DTM of two documents is shown in (2). D1 = [1, 1, 0, 0, 0, 1] and D1[1] = f11 = 1 show that the first feature “are” exists once in the first text document (“You are beautiful!”). Di is labeled with a class c in which

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematical Problems in Engineering	Publication Date: Jan 1, 2015
Citations: 42	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

Distance Variance Score: An Efficient Feature Selection Method in Text Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematical Problems in Engineering

Lead the way for us

Similar Papers

An efficient unsupervised feature selection procedure through feature clustering
Xuyang Yan ... Edward Tunstel
Pattern Recognition Letters | VOL. 131
Xuyang Yan, et. al.Xuyang Yan ... Edward Tunstel
03 Jan 2020
Pattern Recognition Letters | VOL. 131

An Efficient Greedy Method for Unsupervised Feature Selection
Ahmed K Farahat ... Ali Ghodsi
-
Ahmed K Farahat, et. al.Ahmed K Farahat ... Ali Ghodsi
01 Dec 2011
01 Dec 2011

Efficient Text Classification Using Best Feature Selection and Combination of Methods
Mettu Srinivas ... S Anitha Kumari
-
Mettu Srinivas, et. al.Mettu Srinivas ... S Anitha Kumari
01 Jan 2009
01 Jan 2009

Unsupervised graph-based feature selection via subspace and pagerank centrality
K Henni ... C Gouin-Vallerand
Expert Systems with Applications | VOL. 114
K Henni, et. al.K Henni ... C Gouin-Vallerand
25 Jul 2018
Expert Systems with Applications | VOL. 114

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Distance Variance Score: An Efficient Feature Selection Method in Text Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematical Problems in Engineering