Abstract

The science of bioinformatics has been accelerating at a fast pace, introducing more features and handling bigger volumes. However, these swift changes have, at the same time, posed challenges to data mining applications, in particular efficient association rule mining. Many data mining algorithms for high-dimensional datasets have been put forward, but the sheer numbers of these algorithms with varying features and application scenarios have complicated making suitable choices. Therefore, we present a general survey of multiple association rule mining algorithms applicable to high-dimensional datasets. The main characteristics and relative merits of these algorithms are explained, as well, pointing out areas for improvement and optimization strategies that might be better adapted to high-dimensional datasets, according to previous studies. Generally speaking, association rule mining algorithms that merge diverse optimization methods with advanced computer techniques can better balance scalability and interpretability.

Highlights

  • Association rules mining (ARM), an important branch of data mining, has been extensively used in many areas since Agrawal first introduced it in 1993 [1]

  • ARM can be seen as a method aimed at discovering groups of items that co-occur with high frequency

  • A typical application of ARM on such high-throughput datasets is gene association analysis (GAA) [2, 3], in which the goal is to exploit the relationships among different genes based on corresponding expression levels

Read more

Summary

Introduction

Association rules mining (ARM), an important branch of data mining, has been extensively used in many areas since Agrawal first introduced it in 1993 [1]. In contrast to other data mining methods involved with statistical models, ARM can extract possible relationships between variables from. Data from these high-throughput techniques often share in common the feature of high dimensionality. The number of genes in a given study can be in the thousands, while the number of specimens is generally dozens or hundreds Such high dimensionality is true for other kinds of biomedical datasets, e.g., Operational Taxonomic Unit (OTU) abundance datasets that have different levels of extra environmental factors in metagenomics analysis [4], as well as multiple datasets, including mRNA/miRNA expression data and Copy Number Variations (CNV) data from The Cancer Genome Atlas (TCGA) project [5]. To improve performance brought by highdimensional datasets, multiple specialized algorithms have been proposed in the last decade

Basic association rule mining algorithms
Maximal frequent itemset mining and frequent closed itemset mining
Algorithms applicable to high‐dimensional datasets
Discussion
Methods
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.