Abstract

Clustering short texts are one of the most important text analysis methods to help extract knowledge from online social media platforms, such as Twitter, Facebook, and Weibo. However, the instant features (such as abbreviation and informal expression) and the limited length of short texts challenge the clustering task. Fortunately, short texts about the same topic often share some common terms (or term stems), which can effectively represent a topic (i.e., supported by a cluster of short texts), and we also call them topic representative terms. Taking advantage of topic representative terms, it is much easier to cluster short texts by grouping short texts into the most similar topic representative term groups. This paper provides a novel topic representative term discovery (TRTD) method for short text clustering. In our TRTD method, we discover groups of closely bound up topic representative terms by exploiting the closeness and significance of terms. The closeness of the topic representative terms is measured by their interdependent co-occurrence, and the significance is measured by their global term occurrences throughout the whole short text corpus. The experimental results on real-world datasets demonstrate that TRTD achieves better accuracy and efficiency in short text clustering than the state-of-the-art methods.

Highlights

  • Short text documents are increasingly available with the advancement of online social media platforms, such as Twitter, Facebook and Weibo, etc

  • Inspired by the previous studies [5], [10], [18], which use words relation network to address the difficulties in short text clustering, in this paper, we propose a novel topic representative terms discovery (TRTD) method to find those significant terms that are closely bound up with each other as a group of topic representative terms for short text clustering

  • SHORT TEXT CLUSTERING ACCURACY ANALYSIS In this subsection, we study the clustering accuracy of topic representative term discovery (TRTD) and the counterpart methods

Read more

Summary

Introduction

Short text documents are increasingly available with the advancement of online social media platforms, such as Twitter, Facebook and Weibo, etc. Clustering short text documents is one of the most significant text analysis methods to help extract knowledge from the abundant text data on the internet, such as news titles and tweets. According to many researchers [4]–[6], short text clustering is more challenging than the regular text clustering. It is due to the instant features (e.g., abbreviation and informal expression) and shortness of the text that brings sparsity, noise and high dimensionalities in the process of text analytics. Short texts contain lots of noise and provide limited contextual clues for applying traditional data mining techniques. Many adapted approaches were proposed for short text clustering in recent years

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call