Clustering high dimensional sparse transactional data with constraints

Yanrong Li Yanrong Li,R.P Gopalan

doi:10.1109/grc.2006.1635896

Abstract

In this paper, we propose an incremental clustering algorithm called INCLUS for high dimensional sparse transactional data using a newly defined similarity measure and a notion of cluster representatives based on locally frequent items of each cluster. INCLUS seeks structures in transactional data with respect to the support and similarity constraints specified by the users. The effectiveness and the order-independence property of INCLUS are empirically studied and compared with two state- of-art algorithms. Though it is a one-pass algorithm without any iterative refinement, INCLUS is not only effective and scalable, but also insensitive to the order of transactions, which is a crucial property for an incremental algorithm. globally frequent items are neglected during clustering, clusters embedded in such transactions cannot be discovered. In this paper, we define a new notion of cluster representative based on frequent items within each cluster that is easy to comprehend by users, and introduce a new similarity measure especially suitable for sparse transactional data. Using these, an incremental clustering algorithm INCLUS is proposed to cope with high cardinality of transactional datasets. INCLUS is a structure seeking algorithm as it produces clusters based on the support and similarity constraints specified by the users. The domains of input parameters (support and similarity thresholds) are well defined and can be easily chosen by users. INCLUS is significantly different from LargeItem, SUMMARY and TrK-Means which also employ frequent items for clustering. LargeItem uses global optimization to minimize both overlapping of frequent items among clusters and the number of infrequent items for a clustering. INCLUS uses frequent items for local optimization such that similarities for a clustering are maximized with respect to the similarity constraints. In SUMMARY, the frequent items used are globally frequent in the whole data set while in INCLUS they are only locally frequent within a cluster. Unlike INCLUS, TrK-Means does not use similarity constraints and it produces k or more clusters based on the user input of k. Squeezer (7) is an incremental clustering algorithm for categorical data. It uses a different similarity function from INCLUS, which is defined as the number of matched attribute values in a pair of tuples.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Clustering high dimensional sparse transactional data with constraints

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

JDINAC: joint density-based non-parametric differential interaction network analysis and classification using high-dimensional sparse omics data.
Jiadong Ji ... Di He
Bioinformatics (Oxford, England) | VOL. 33
Jiadong Ji, et. al.Jiadong Ji ... Di He
05 Jun 2017
Bioinformatics (Oxford, England) | VOL. 33

Sparse Stochastic Online AUC Optimization for Imbalanced Streaming Data
Min Yang ... Ruimin Hu
-
Min Yang, et. al.Min Yang ... Ruimin Hu
01 Jan 2018
01 Jan 2018

Online AUC Optimization for Sparse High-Dimensional Datasets
Baojian Zhou ... Steven Skiena
-
Baojian Zhou, et. al.Baojian Zhou ... Steven Skiena
01 Nov 2020
01 Nov 2020

Fuzzy partition based soft subspace clustering and its applications in high dimensional data
Jun Wang ... Zhaohong Deng
Information Sciences | VOL. 246
Jun Wang, et. al.Jun Wang ... Zhaohong Deng
28 May 2013
Information Sciences | VOL. 246

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering high dimensional sparse transactional data with constraints

Abstract

Talk to us

Similar Papers