Abstract

Text document clustering is a set of large textual documents which are more contextually similar in the same group. This paper has studied the existing methodologies for clustering large text documents. The analysis of word embeddings began with bag-of-word models, but it has progressed to transformer-based architectures. This work has found that transformer-based models can generate contextual text embeddings that incorporate both the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) is a state-of-the-art approach for producing contextual word embeddings that have been used for clustering. This research has introduced a novel text clustering method using BERT’s contextual word embeddings. The paper also presents a novel cluster to domain (class label) mapping algorithm to evaluate the performance of the clustering algorithms with respect to the documents’ actual domain (class label). Several clustering algorithms such as K-Means, K-Medoids, and DBSCAN have been applied over the embeddings of BERT to cluster large text documents based on contextual meaning. Two cluster quality metrics were used to measure the algorithm’s performance: Dunn index and Silhouette score. K-Means delivers better results as per these two quality metrics. DBSCAN, however, maps the documents to the appropriate domain more accurately. The paper demonstrates the viability of text clustering using transformer-based embeddings.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call