Contrastive author-aware text clustering

Xudong Tang,Chao Dong,Wei Zhang

doi:10.1016/j.patcog.2022.108787

Abstract

In the era of User Generated Content (UGC), authors (IDs) of texts widely exist and play a key role in determining the topic categories of texts. Existing text clustering efforts are mainly attributed to utilizing textual information, but the effect of authors on text clustering remains largely underexplored. To mitigate this issue, we propose a novel Contrastive Author-aware Text clustering approach, dubbed as CAT. CAT injects author information not only in characterizing texts through representations but also in pushing or pulling text representations of different authors through contrastive learning, which is rarely adopted by text clustering. Specifically, the developed contrastive learning method conducts both cluster-instance contrast by the text representation augmentation and instance-instance contrast by the multi-view representations. We perform comprehensive experiments on three public datasets, demonstrating that CAT largely outperforms strong competitive text clustering baselines and validating the effectiveness of the CAT’s main components.

Full Text