Abstract

In applications of information retrieval, text mining, and natural language processing, tf-idf (term frequency-inverse document frequency) is a numerical statistic that is intended to reflect how significant a word is to a document in a collection. The value of tf-idf increases proportionally to the number of times a word occurs in the document and is offset by the number of documents in the corpus that contain the word, reflecting the fact that some words appear more frequently in general. Therefore, the value of tf-idf is designed to be more significant in a certain document when a word occurs frequently. In other words, document classification using tf-idf does not care about the role of the infrequent words. In this paper, we focus on words that appear infrequently in a document. Specifically, we examine features that characterize document sets that describe specific knowledge using the SVM (Support Vector Machine) based feature extraction method. As a result, we confirmed that the words appeared only once in some of a document that belong to documents describing specific knowledge and contribute to distinguishing them from the documents that describe general knowledge.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.