Abstract

In this paper, we propose an incremental clustering algorithm called INCLUS for high dimensional sparse transactional data using a newly defined similarity measure and a notion of cluster representatives based on locally frequent items of each cluster. INCLUS seeks structures in transactional data with respect to the support and similarity constraints specified by the users. The effectiveness and the order-independence property of INCLUS are empirically studied and compared with two state- of-art algorithms. Though it is a one-pass algorithm without any iterative refinement, INCLUS is not only effective and scalable, but also insensitive to the order of transactions, which is a crucial property for an incremental algorithm. globally frequent items are neglected during clustering, clusters embedded in such transactions cannot be discovered. In this paper, we define a new notion of cluster representative based on frequent items within each cluster that is easy to comprehend by users, and introduce a new similarity measure especially suitable for sparse transactional data. Using these, an incremental clustering algorithm INCLUS is proposed to cope with high cardinality of transactional datasets. INCLUS is a structure seeking algorithm as it produces clusters based on the support and similarity constraints specified by the users. The domains of input parameters (support and similarity thresholds) are well defined and can be easily chosen by users. INCLUS is significantly different from LargeItem, SUMMARY and TrK-Means which also employ frequent items for clustering. LargeItem uses global optimization to minimize both overlapping of frequent items among clusters and the number of infrequent items for a clustering. INCLUS uses frequent items for local optimization such that similarities for a clustering are maximized with respect to the similarity constraints. In SUMMARY, the frequent items used are globally frequent in the whole data set while in INCLUS they are only locally frequent within a cluster. Unlike INCLUS, TrK-Means does not use similarity constraints and it produces k or more clusters based on the user input of k. Squeezer (7) is an incremental clustering algorithm for categorical data. It uses a different similarity function from INCLUS, which is defined as the number of matched attribute values in a pair of tuples.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.