Abstract

Weighting and normalization are the most important factor that may affect the text representation significantly. This paper presents two novel term weighting schemes to represent text documents, namely, i). Term-weighting scheme for document representation based on Term Frequency - Ranking of Term Frequency (TF-RTF) and ii). Term-weighting scheme for document representation based on Term Frequency - Ranking of fuzzy logic with semantic relationship of terms (TF-RFST). The ranking of each term in a document provides its priority of the document and uses these priorities for document representation in TF-RTF. In TF-RFST, each term is represented based on its frequency and the frequency of semantic related terms for that term. Hence, the ranking of each term is based on the combined frequencies of the term and its semantic related terms with a specific weighting scheme. With appropriate weighting schemes such as TF-RFT and TF-RFST, the proposed methods provide better clustering performance in terms of accuracy, entropy, recall and F-Measure than previously suggested methods, such as word count, Term Frequency-Inverse Document Frequency (TF-IDF), Term Frequency-Inverse Corpus Frequency (TF-ICF), Multi Aspect TF (MATF), BM25 and BM25F. Experiments carried out on the Reuters-8, Reuters-52 and WebKB data sets with K-means and K-means++ clustering algorithms for demonstrate the effectiveness of the proposed term weighting schemes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.