Abstract

In Data mining and Knowledge discovery, clustering is one of the most important techniques in the process of discovering salient structures from the data. This paper explores the idea of statistical consensus method for combining results from multiple clustering or partitions. We explored this idea when working with customs data from Revenue Authority. The partitions are generated by running k-means algorithm several times which produces diverse clustering results with different parameter initializations or subspaces in each time from the same data. To achieve the combination for the final clustering result, our algorithm first selects a Reference partition with best clustering results among created partitions. Then it selects partitions which are consistent by employing the Mutual Information between partitions as the selection criteria. The partitions with mutual information less than a set threshold value are discarded from the ensemble. Finally the selected partitions that create the ensemble are combined by the consensus function to achieve the final clustering results. Our consensus function uses the original features of the dataset in collaboration with the partitions results to attain the final clustering. Experiments shows that our algorithm achieves better clustering results than the classical k-means algorithm in terms of accuracy from both synthetic and real datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call