Abstract
During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.
Highlights
One of the most important research fields in data mining is mining interesting patterns in large data sets
Of frequent itemset mining, biclustering, another important data mining concept, was proposed to complement and expand the capabilities of the standard clustering methods by allowing objects to belong to multiple or none of the resulting clusters purely based on their similarities
The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner
Summary
One of the most important research fields in data mining is mining interesting patterns (such as sequences, episodes, association rules, correlations, or clusters) in large data sets. Of frequent itemset mining, biclustering, another important data mining concept, was proposed to complement and expand the capabilities of the standard clustering methods by allowing objects to belong to multiple or none of the resulting clusters purely based on their similarities This property makes biclustering a powerful approach especially when it is applied to data with a large number of objects. One of the most important properties of biclustering when applied to binary (0, 1) data is that it provides the same results as frequent closed itemsets mining (Figure 1) Such biclusters, called inclusion-maximal biclusters (or IMBs), were introduced in [7] together with a mining algorithm, BiMAX, to discover all biclusters in a binary matrix that are not entirely contained by any other cluster. The proposed algorithm has been implemented in the commonly used MATLAB environment, rigorously tested on both synthetic and real data sets, and freely available for researchers (http://pr.mk.unipannon.hu/Research/bit-table-biclustering/)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have