Abstract
Although attempts have been made to solve the problem of clustering categorical data via cluster ensembles, with the results being competitive to conventional algorithms, it is observed that these techniques unfortunately generate a final data partition based on incomplete information. The underlying ensemble-information matrix presents only cluster-data point relations, with many entries being left unknown. The paper presents an analysis that suggests this problem degrades the quality of the clustering result, and it presents a BSA (Bootstrap Aggregation) is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy along with a new link-based approach, which improves the conventional matrix by discovering unknown entries through similarity between clusters in an ensemble. In particular, an efficient BSA and link-based algorithm is proposed for the underlying similarity assessment. Afterward, to obtain the final clustering result, a graph partitioning technique is applied to a weighted bipartite graph that is formulated from the refined matrix. Experimental results on multiple real data sets suggest that the proposed link-based method almost always outperforms both conventional clustering algorithms for categorical data and well-known cluster ensemble techniques.
Highlights
Bootstrap aggregating is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression
The method introduced to creates an ensemble by applying a conventional clustering algorithm (e.g., k-modes [8] and COOLCAT [17]) to different data partitions, each of which is constituted by a unique subset of data attributes
The experiments set out to investigate the performance of link-based cluster ensemble (LCE) compared to a number of clustering algorithms, both developed for categorical data analysis and those state-of-the-art cluster ensemble techniques found in literature
Summary
Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification). Many well-established clustering algorithms, such as k-means [1] and PAM [2], have been designed for numerical data, whose inherent properties can be naturally employed to measure a distance (e.g., Euclidean) between feature vectors [3], [4]. These cannot be directly applied for clustering of categorical data, where domain values are discrete and have no ordering defined. The initial method was developed in [6] by making use of Gower’s similarity coefficient [7]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.