An algorithmic approach to mining unknown clusters in training data

Robert S Lynch, Jr,Peter K Willett

doi:10.1117/12.664731

Abstract

In this paper, unsupervised learning is utilized to develop a method for mining unknown clusters in training data. The approach is based on the Bayesian Data Reduction Algorithm (BDRA), which has recently been developed into a patented system called the Data Extraction and Mining Software Tool (DEMIST). In the BDRA, the modeling assumption is that the discrete symbol probabilities of each class are a priori uniformly Dirichlet distributed, and it employs a "greedy" approach to selecting and discretizing the relevant features of each class for best performance. The primary metric for selecting and discretizing all relevant features contained in each class is an analytic formula for the probability of error conditioned on the training data. Thus, the primary contribution of this work is to demonstrate an algorithmic approach to finding multiple unknown clusters in training data, which represents an extension to the original data clustering algorithm. To illustrate performance, results are demonstrated using simulated data that contains multiple clusters. In general, the results of this work will demonstrate an effective method for finding multiple clusters in data mining applications.

Full Text