Abstract

In text mining area, popular methods use the bag-of-words models, which represent a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper proposes a new similarity measure based on suffix tree model of text documents. It analyzes the word sequence information, and then computes the similarity between the text documents of corpus by applying a suffix tree similarity that combines with TF-IDF weighting method. Experimental results on standard document benchmark corpus RUTERS and BBC indicate that the new text similarity measure is effective. Comparing with the results of the other two frequent word sequence based methods, our proposed method achieves an improvement of about 15% on the average of F-Measure score.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call