Abstract

In text classification, it is necessary to perform feature selection to alleviate the curse of dimensionality caused by high-dimensional text data. In this paper, we utilize class term frequency (CTF) and class document frequency (CDF) to characterize the relevance between terms and categories in the level of term frequency (TF) and document frequency (DF). On the basis of relevance measurement above, three feature selection methods (ADF based on CTF (ADF-CTF), ADF based on CDF (ADF-CDF), and ADF based on both CTF and CDF (ADF-CTDF)) are proposed to identify relevant and discriminant terms by introducing absolute deviation factors (ADFs). Absolute deviation, a statistic concept, is first adopted to measure the relevance divergence characterized by CTF and CDF. In addition, ADF-CTF and ADF-CDF can be combined with existing DF-based and TF-based methods, respectively, which results in new ADF-based methods. Experimental results on six high-dimensional textual datasets using three classifiers indicate that ADF-based methods outperform original DF-based and TF-based ones in 89% cases in terms of Micro-F1 and Macro-F1, which demonstrates the role of ADF integrated in existing methods to boost the classification performance. In addition, findings also show that ADF-CTDF ranks first averagely among multiple datasets and significantly outperforms other methods in 99% cases.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.