Abstract
Research of data mining has developed many technologies of filtering out useful information from vast data, documents clustering is one of the important technologies. There are two approaches of documents clustering, one is clustering with metadata of documents, and the other is clustering with content of documents. Most of previous clustering approaches with documents contents focused on the documents summary (summary of single or multiple files) and the words vector analysis of documents, found the few and important keywords to conduct documents clustering. In this study, we categorize hot commodity on the web then denominate them, in accordance with the web text (abstracts) of these hot commodity and their accessing times. Firstly, parsing Chinese web text of documents for hot commodity, applied the hierarchical agglomerative clustering algorithm–Ward method to analyze the properties of words into themes and decide the number s of themes. Secondly, adopting the Cross Collection Mixture Model which applied in Temporal Text Mining and the accessing times( the degree of user identification words) to collect dynamic themes, then gather stable words by probability distribution to be the vectors of documents clustering. Thirdly, estimate parameters with Expectation Maximization (EM) algorithm. Finally, apply K-means with extracted dynamic themes to be the features of documents clustering. This study proposes a novel approach of documents clustering and through a series of experiment, it is proven that the algorithm is effective and can improve the accuracy of clustering results.KeywordsDocuments ClusteringTemporal Text MiningExtracting Theme
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.