Research and Improvement of TF-IDF Algorithm Based on Information Theory

Long Cheng,Yang Yang,Zhipeng Gao,Kang Zhao

doi:10.1007/978-3-030-14680-1_67

Abstract

With the development of information technology and the increasing richness of network information, people can more and more easily search for and obtain the required information from the network. However, how to quickly obtain the required information in the massive network information is very important. Therefore, information retrieval technology emerges, One of the important supporting technologies is keyword extraction technology. Currently, the most widely used keyword extraction technique is the TF-IDFs algorithm (Term Frequency-Inverse Document Frequency). The basic principle of the TF-IDF algorithm is to calculate the number of occurrences of words and the frequency of words. It ranks and selects the top few words as keywords. The TF-IDF algorithm has features such as simplicity and high reliability, but there are also deficiencies. This paper analyzes its shortcomings for an improved TFIDF algorithm, and optimizes it from the information theory point of view. It uses the information entropy and relative entropy in information theory as the calculation factor, adds to the above improved TFIDF algorithm, optimizes its performance, and passes Simulation experiments verify its performance.

Full Text