Abstract
This study focuses on high-dimensional text data clustering, given the inability of K-means to process high-dimensional data and the need to specify the number of clusters and randomly select the initial centers. We propose a Stacked-Random Projection dimensionality reduction framework and an enhanced K-means algorithm DPC-K-means based on the improved density peaks algorithm. The improved density peaks algorithm determines the number of clusters and the initial clustering centers of K-means. Our proposed algorithm is validated using seven text datasets. Experimental results show that this algorithm is suitable for clustering of text data by correcting the defects of K-means.
Highlights
Clustering is the main technique used for unsupervised information extraction
This study proposed a Stacked-Random Projection (SRP) dimension reduction framework based on deep networks and an improved K-means text clustering algorithm based on density peak (DPC-K-means)
SRP, the improved DPC, and DPC-K-means were validated by using different datasets
Summary
Clustering is the main technique used for unsupervised information extraction. In clustering, the aim is to divide the unlabelled dataset into multiple nonoverlapping class clusters, making the data points in the cluster as similar as possible, while making the data points between the clusters as different as possible. Some research [4,5,6,7,8,9] has proposed improvements on the text clustering algorithm, and some studies [10, 11] have proposed improvements on the K-means algorithm. Improper selection of the initial center can cause the clustering result trap into the local optimal solution and lead to an inaccurate clustering result. In recognition of these problems, we propose an enhanced K-means text clustering algorithm based on the clustering by fast search and find of density peaks (DPC) algorithm [12]. Conclusions concludes the paper and highlights future work related to the study
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.