Abstract

With the rapid development of Internet technology, the influence of online consensus continues to expand. How to quickly and effectively discover sensitive topics and keep track of those topics has become an important research recently. Text clustering can aggregate news texts with the same or similar content to achieve the purpose of discovering topics automatically. Make improvement to clustering algorithm according to different media types is the main research direction. Although the existing typical clustering algorithms have certain advantages, they all face constraints on data size and data characteristics in specific applications. There is no existing algorithm can fully adapt to these characteristics. Although the application of more Single-pass algorithms in the (TDT) field can realize the discovery and tracking of topics, there are disadvantages of poor accuracy and slow speed under massive data. According to the dynamic evolution characteristics of online consensus, this paper proposes an incremental text clustering algorithm based on Single-pass, which optimizes the clustering accuracy and efficiency of massive news. Based on the real online news texts from the online consensus analysis system, we conduct an experiment to test and verify the feasibility and effectiveness of the algorithm we proposed. The result shows that the new algorithm is much more efficient compared to the original Single-pass clustering algorithm. In the real application, the new incremental text clustering algorithm basically meets the real-time demand of online topic detection and has a certain practical value.

Highlights

  • Considering the text clustering process for topic detection must give consideration both on clustering result and efficiency based on actual requirements, in this paper we propose an incremental text clustering algorithm which based on Simhash

  • According to experimental result and the effect in actual application, we set the value of K as 3, which means before the incremental text clustering of each batch of texts, pick active text clusters in three days to initialize the text clusters set

  • Experiment Results Analysis Experiment result indicates that the news texts of most topics releases in a concentrated time obviously, declaring that the improvement of introducing time window parameter based on news timeliness is reasonable and doable

Read more

Summary

Related work

The algorithm applies to topic detection mainly based on incremental clustering algorithm. As the incremental text clustering algorithm can utilize the last clustering result, avoiding re-clustering the whole text cluster, improves the clustering efficiency greatly, has the possibility to satisfy the real-time performance in the requirements of topic detection process. The preprocessing decreases the scale and complexity of similarity calculation and is easy to apply to large-scale text clustering analysis in actual application scenario. The timeliness ensures the utility value of topic detection It requires the algorithm must simple and efficient. 3. The news data in topic detection increase all the time. The problems and difficulties above refer to the process of topic detection faces the increasing mass news data and the objective requirement of real-time processing. In this paper, the research priority is how to improve the efficient of text clustering process as far as possible

The Whole Scheme Design
Text Vectorization Based on News Text Feature
Dimensionality Reduction Based on Simhash
Improved Single-Pass Algorithm Based on News Timeliness
Statistic
Time Complexity Analysis
Assessment of Clustering Algorithms
The Impact of Different Link Strategy on Clustering Result
Realistic Application
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.