Abstract
Clustering documents into coherent categories is a very useful and important step for document processing and understanding. The introducing of fuzzy set theory into clustering provides a favorable mechanism to capture overlapping among document clusters. Document dataset is commonly represented as a collection of high-dimensional vectors, which may not be able to fit into memory entirely, when the dataset is large and with a very high dimensionality. However, most of the existing fuzzy clustering approaches deal with small and static datasets. Some of them may have a good scalability but they are only effective for low dimensional data. The study presented in this paper is about new efforts on fuzzy clustering of large-scale and high-dimensional data-especially suitable for document categorization. To consider both large scale and high dimensionality into the problem formulation, our key idea is to incorporate document-tailored fuzzy clustering into a scheme, which is effective for dealing with a large-scale problem. We first identified three representative schemes in fuzzy clustering for handling large-scale data, namely sampling extension, single pass, and divide ensemble. The limitation of fuzzy C-means (FCM)-based approaches for a large document clustering are then investigated. Based on the study, we propose new approaches by incorporating each of hyperspherical FCM and fuzzy coclustering with the three scale-up schemes, respectively. This enables our new approaches to maintain effectiveness for high-dimensional data with an extended scalability. Extensive experimental studies with real-world large document datasets have been conducted and the results demonstrate that the proposed approaches perform consistently better over existing ones in document categorization.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.