A Gradient-Based Clustering for Multi-Database Mining

Salim Miloudi,Wenjia Ding,Yulin Wang

doi:10.1109/access.2021.3050404

Abstract

Multinational corporations have multiple databases distributed throughout their branches, which store millions of transactions per day. For business applications, identifying disjoint clusters of similar and relevant databases contributes to learning the common buying patterns among customers and also increases the profits by targeting potential clients in the future. This process is called clustering, which is an important unsupervised technique for big data mining. In this article, we present an effective approach to search for the optimal clustering of multiple transaction databases in a weighted undirected similarity graph. To assess the clustering quality, we use dual gradient descent to minimize a constrained quasi-convex loss function whose parameters will determine the edges needed to form the optimal database clusters in the graph. Therefore, finding the global minimum is guaranteed in a finite and short time compared with the existing non-convex objectives where all possible candidate clusterings are generated to find the ideal clustering. Moreover, our algorithm does not require specifying the number of clusters a priori and uses a disjoint-set forest data structure to maintain and keep track of the clusters as they are updated. Through a series of experiments on public data samples and precomputed similarity matrices, we show that our algorithm is more accurate and faster in practice than the existing clustering algorithms for multi-database mining.

Highlights

The emergence of large multi-branch companies has led to developing new strategies for mining the transaction databases located at their different branches
Clustering in the artificial neural networks (ANNs) literature is usually based on a competitive learning (CL) paradigm [16]–[18] where codebook weight vectors compete in order to elect the best matching unit (BMU), i.e., a neuron unit whose weight vector has the minimum distance to an input vector
We carried out the experiments on real world datasets, including Mushroom, Zoo and Iris, available for download from the UCI Machine Learning Repository [68], and we used a synthetic dataset T10I4D100K available on the Frequent Itemset Mining Dataset Repository [64]

Summary

Introduction

The emergence of large multi-branch companies has led to developing new strategies for mining the transaction databases located at their different branches. To make decisions at a global level, the traditional process consists of integrating all the branch databases into a central repository called data warehouse, and traditional mining algorithms [1]–[3] are applied on this huge accumulated dataset to discover the global patterns supported by all the branches. The CL algorithm continues to select and update the BMU until reaching a certain number of iterations Despite their simplicity, there are some major limitations associated with CL algorithms, including sensitivity to initialization and difficulty of choosing an appropriate number of clusters beforehand

Methods

Results

Conclusion