Deep Feature-Based Text Clustering and its Explanation

Renchu Guan,Yanchun Liang,Xiaoyue Feng,Fausto Giunchiglia,Lan Huang,Hao Zhang

doi:10.1109/tkde.2020.3028943

Renchu Guan, Yanchun Liang + Show 4 more

Open Access

https://doi.org/10.1109/tkde.2020.3028943

Copy DOI

Abstract

Text clustering is a critical step in text data analysis and has been extensively studied by the text mining community. Most existing text clustering algorithms are based on the bag-of-words model, which faces the high-dimensional and sparsity problems and ignores text structural and sequence information. Deep learning-based models such as convolutional neural networks and recurrent neural networks regard texts as sequences but lack supervised signals and explainable results. In this paper, we propose a <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">d eep <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f eature-based <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">t ext <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">c lustering ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DFTC ) framework that incorporates pretrained text encoders into text clustering tasks. This model, which is based on sequence representations, breaks the dependency on supervision. The experimental results show that our model outperforms classic text clustering algorithms and the state-of-the-art pretrained language model, i.e., BERT, on almost all the considered datasets. In addition, the explanation of the clustering results is significant for understanding the principles of the deep learning approach. Our proposed clustering framework includes an explanation module that can help users understand the meaning and quality of the clustering results.

Highlights

C Lustering models attempt to classify objects based on their similarity in a valid representation
We show that our deep feature-based text clustering (DFTC) framework outperforms classic text clustering algorithms and SOTA pretrained language models on the considered datasets
We have proposed a deep feature-based text clustering (DFTC) framework that integrates sequence information and natural language inference semantics

Summary

Introduction

C Lustering models attempt to classify objects based on their similarity in a valid representation. Guan et al [4] proposed a similarity metric for text clustering to capture the structural information of texts, and Song et al [5] applied a concept knowledge base to extend text features and enhanced the semantics of the representation. These models are still based on feature space models and cannot solve the problem of poor semantic understanding

Objectives

Methods

Results

Conclusion