Parallel extraction of association rules from genomics data

Giuseppe Agapito,Pietro Hiram Guzzi,Mario Cannataro

doi:10.1016/j.amc.2017.09.026

Giuseppe Agapito, Pietro Hiram Guzzi + Show 1 more

https://doi.org/10.1016/j.amc.2017.09.026

Copy DOI

Abstract

High-throughput experimental platforms like microarrays produce massive amounts of omics data for each analyzed sample. As an example, the Affymetrix DMET (Drug Metabolizing Enzymes and Transporters) microarray platform can discover Single Nucleotide Polymorphisms (SNPs) from 225 human genes involved in absorption, distribution, metabolism, and excretion (ADME) of drugs, enabling large pharmacogenomics studies. Moreover, the application of such platforms to large populations of subjects is further increasing the size of experimental datasets produced in clinical studies. Thus, the production of big omics datasets is a first reason to use parallel computing in bioinformatics. Such omics datasets are usually analyzed with classical statistical analysis and, more recently, by using data mining methods that can extract knowledge hidden in the data, e.g. by highlighting multiple associations among features of the data. However, the use of standard off-the-shelf data mining algorithms to large omic datasets, especially when considering association rule mining, poses two main issues: (i) huge requests of central memory that may prevent the execution of data mining software on personal/desktop computers; and (ii) very long response time, that may increase the time requested for completing extensive pharmacogenomics studies. To overcome the limits of standard association rule mining algorithms when applied to omics datasets, we propose PARES (Parallel Association Rules Extractor from SNPs), a novel parallel algorithm for the efficient extraction of association rules from omics datasets. PARES is implemented as a multi-thread version of an optimized version of the Frequent Pattern Growth (FP-Growth) algorithm. Moreover, it includes a customized SNPs datasets preprocessing strategy based on a Fisher’s Test Filter to discard the trivial transactions from the input dataset, reducing the search space from which to build many independent FP-Trees. The experimental results show that PARES has a good speedup and a high memory management efficiency, with respect to several association rule mining algorithms implemented in main off-the-shelf data mining platforms.

Full Text