An Improved TF-IDF Algorithm Based on Class Discriminative Strength for Text Categorization on Desensitized Data

Ting Zhang,Shuzhi Sam Ge

doi:10.1145/3319921.3319924

Abstract

The TF-IDF algorithm is one of the most common methods of text representation. The basic tasks of natural language processing, for example, text classification, generally use it to represent text. However, the traditional TF-IDF algorithm doesn't work well and has many shortcomings. Actually, many improved methods have achieved good results, but they are ineffective for desensitized data or encrypted data. In this paper, we propose a novel notion, class discriminative strength, and make use of it to improve TF-IDF. The new algorithm is named TF-IDF-ρ and we utilize it to represent desensitized data for text classification. It's worth mentioning that experimental results from the validation set, such as the recall rate, the precision rate and F1 measure, illustrate it is effective. At last, experiments on the desensitized test set indicate that, related to the traditional TF-IDF, the TF-IDF-ρ increase F1 measure by 4.07% at most.

Full Text