Abstract
The current TFIDF (Term Frequency and Inverted Document Frequency) algorithm cannot effectively reflect the relationship between the importance of a word and its distribution. This paper proposes a Class Variance-Term Frequency and Inverted Document Frequency algorithm. This algorithm improves the TFIDF algorithm based on three distribution factors: category, inter-class and variance. In order to measure the optimization effect of this method, three algorithms were compared using the original algorithm, improved algorithm and TFIDF algorithm based on dual parallel calculation model. Experiments show that the improved algorithm has significantly improved recall, accuracy, and F metric values, comparing with the original algorithm, and has improved compared with the TFIDF algorithm based on dual parallel calculation model. Therefore, the improved algorithm can well adapt to the demand for feature word extraction and has better text classification performance.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have