Abstract

During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.

Highlights

  • One of the most important research fields in data mining is mining interesting patterns in large data sets

  • Of frequent itemset mining, biclustering, another important data mining concept, was proposed to complement and expand the capabilities of the standard clustering methods by allowing objects to belong to multiple or none of the resulting clusters purely based on their similarities

  • The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner

Read more

Summary

Introduction

One of the most important research fields in data mining is mining interesting patterns (such as sequences, episodes, association rules, correlations, or clusters) in large data sets. Of frequent itemset mining, biclustering, another important data mining concept, was proposed to complement and expand the capabilities of the standard clustering methods by allowing objects to belong to multiple or none of the resulting clusters purely based on their similarities This property makes biclustering a powerful approach especially when it is applied to data with a large number of objects. One of the most important properties of biclustering when applied to binary (0, 1) data is that it provides the same results as frequent closed itemsets mining (Figure 1) Such biclusters, called inclusion-maximal biclusters (or IMBs), were introduced in [7] together with a mining algorithm, BiMAX, to discover all biclusters in a binary matrix that are not entirely contained by any other cluster. The proposed algorithm has been implemented in the commonly used MATLAB environment, rigorously tested on both synthetic and real data sets, and freely available for researchers (http://pr.mk.unipannon.hu/Research/bit-table-biclustering/)

Problem Formulation
Mining Frequent Closed Itemsets Using Bit-Table Operations
Experimental Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call