Abstract

While microblog is developing rapidly in China, microblog messages are also flooded with a large amount of repetitive information. Simhash algorithm has better precision and efficiency in the existing algorithms of similarity computation. In this paper, according to the actual scene of microblog, the deep optimization of traditional simhash is proposed through the segmentation optimization algorithm (Combined-Analyzer) and weight optimization algorithm (FFBOT-FID). To a certain extent, Combined-Analyzer solved the problem which real scene existed the massive internet words in microblog’s short text and FFBOT-FID helped us solve the problem of calculating weight which was caused by short text and timeliness. The experimental results use in microblog de-duplication and show that the optimization has a higher precision and recall rate than the traditional segmentation algorithm.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.