AbstractOver the years, various document‐clustering techniques were developed to group the textual data. The performance of document clustering systems heavily relies on the optimal use of text representations. Vector space model is an extensively used technique by existing clustering algorithms to present the text in a structured form. However, such representations suffer from a lack of semantic associations, high dimensionality, and sparsity. In order to enrich the document representation by retaining the semantic and morphological associations, this paper introduced a word cluster‐based modified term frequency‐inverse document frequency (WC_MTI) model, in which semantically associated word embeddings from the word2vec are supplemented with morphological information using kernel principal component analysis. In addition, to address high dimensionality and sparsity issues and improve the clustering, we use a self‐training technique that learns discriminative features using the WC_MTI model and autoencoder (AE) and then updates the encoder network weights using assignments from a clustering algorithm as supervision. The proposed model organizes documents into topically compatible clusters by maintaining the semantic and morphological similarity between terms using the skip‐gram with negative sampling (SGNS) and low‐dimensional vector representations. We evaluate the proposed approach against the existing text representation methods. Experimental findings suggest that the proposed approach enhanced the average accuracy by 89.62%.
Read full abstract