Discriminative features for text document classification

K Torkkola

doi:10.1007/s10044-003-0196-8

Abstract

The bag-of-words approach to text document representation typically results in vectors of the order of 5000–20,000 components as the representation of documents. To make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion. As a remedy, we suggest feature transforms based on Linear Discriminant Analysis (LDA). Since LDA requires operating both with large and dense matrices, we propose an efficient intermediate dimension reduction step using either a random transform or LSI. We report good classification results with the combined feature transform on a subset of the Reuters-21578 database. Drastic reduction of the feature vector dimensionality from 5000 to 12 actually improves the classification performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Discriminative features for text document classification

Abstract

Talk to us

Similar Papers

More From: Formal Pattern Analysis & Applications

Lead the way for us

Journal: Formal Pattern Analysis & Applications	Publication Date: Feb 1, 2004
Citations: 73

Similar Papers

Texture analysis of muscle MRI: machine learning-based classifications in idiopathic inflammatory myopathies
Keita Nagawa ... Kaiji Inoue
Scientific Reports | VOL. 11
Keita Nagawa, et. al.Keita Nagawa ... Kaiji Inoue
10 May 2021
Scientific Reports | VOL. 11

Comparison of Dimensionality Reduction Methods for Road Surface Identification System
Gonzalo Safont ... Alberto Rodríguez
-
Gonzalo Safont, et. al.Gonzalo Safont ... Alberto Rodríguez
01 Jan 2020
01 Jan 2020

Evaluating Outcome Prediction via Baseline, End-of-Treatment, and Delta Radiomics on PET-CT Images of Primary Mediastinal Large B-Cell Lymphoma.
Fereshteh Yousefirizi ... Ivan S Klyuzhin
Cancers | VOL. 16
Fereshteh Yousefirizi, et. al.Fereshteh Yousefirizi ... Ivan S Klyuzhin
08 Mar 2024
Cancers | VOL. 16

Exploring the performance of feature selection method using breast cancer dataset
Tsehay Admassu Assegie ... Vadivel Elanangai
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 25
Tsehay Admassu Assegie, et. al.Tsehay Admassu Assegie ... Vadivel Elanangai
01 Jan 2021
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Discriminative features for text document classification

Abstract

Talk to us

Similar Papers

More From: Formal Pattern Analysis &amp; Applications

More From: Formal Pattern Analysis & Applications