Significance of Low-Frequent Words in Concept Describing Document

Yuki Okumura,Sachio Hirokawa,Kazuhiro Takeuchi

doi:10.1109/iiai-aai.2019.00214

Abstract

In applications of information retrieval, text mining, and natural language processing, tf-idf (term frequency-inverse document frequency) is a numerical statistic that is intended to reflect how significant a word is to a document in a collection. The value of tf-idf increases proportionally to the number of times a word occurs in the document and is offset by the number of documents in the corpus that contain the word, reflecting the fact that some words appear more frequently in general. Therefore, the value of tf-idf is designed to be more significant in a certain document when a word occurs frequently. In other words, document classification using tf-idf does not care about the role of the infrequent words. In this paper, we focus on words that appear infrequently in a document. Specifically, we examine features that characterize document sets that describe specific knowledge using the SVM (Support Vector Machine) based feature extraction method. As a result, we confirmed that the words appeared only once in some of a document that belong to documents describing specific knowledge and contribute to distinguishing them from the documents that describe general knowledge.

Full Text