Abstract

As a vital step of text classification (TC) task, the assignment of term weight has a great influence on the performance of TC. Currently, masses of term weighting methods can be utilized, such as term frequency-inverse documents frequency and term frequency-relevance frequency (TF-RF). It can be found that they are both consisted of local part (TF) and global part (e.g., IDF, RF). However, most of these methods adopt the logarithmic processing on their respective global parts, so it is natural to consider whether the logarithmic processing applies to all these methods or not. Actually, for a specific term weighting method, due to its different ratio of local weight and global weight resulting from logarithmic processing, it usually shows diverse text classification results on different text sets, which shows poor robustness. To explore the influence of logarithmic processing imposed on the TC performance of term weighting methods, TF-RF is selected as the representative because it can achieve relatively stable performance among these methods adopting logarithmic processing. Then, in order to balance the local part and global part of TF-RF, an improved term weighting method based on TF-RF is proposed, named as term frequency-exponential relevance frequency (TF-ERF). And two groups of experiments are conducted on TF-ERF and other existing term weighting methods based on two general standard corpora. The results show that the improved term weighting method TF-ERF has better text classification performance and robustness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call