Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

Simeone Marino,Lu Zhao,Yi Zhao,Ivo D Dinov,Arthur W Toga,Yiwang Zhou,Qiucheng Wu,Yehu Chen,Nina Zhou,Yichen Yang,Brandon Cummings,Yingsi Jian,Jessica Wild

doi:10.1371/journal.pone.0228520

Simeone Marino, Lu Zhao + Show 11 more

Open Access

https://doi.org/10.1371/journal.pone.0228520

Copy DOI

Abstract

Health advances are contingent on continuous development of new methods and approaches to foster data-driven discovery in the biomedical and clinical sciences. Open-science and team-based scientific discovery offer hope for tackling some of the difficult challenges associated with managing, modeling, and interpreting of large, complex, and multisource data. Translating raw observations into useful information and actionable knowledge depends on effective domain-independent reproducibility, area-specific replicability, data curation, analysis protocols, organization, management and sharing of health-related digital objects. This study expands the functionality and utility of an ensemble semi-supervised machine learning technique called Compressive Big Data Analytics (CBDA). Applied to high-dimensional data, CBDA (1) identifies salient features and key biomarkers enabling reliable and reproducible forecasting of binary, multinomial and continuous outcomes (i.e., feature mining); and (2) suggests the most accurate algorithms/models for predictive analytics of the observed data (i.e., model mining). The method relies on iterative subsampling, combines function optimization and statistical inference, and generates ensemble predictions for observed univariate outcomes. The novelty of this study is highlighted by a new and expanded set of CBDA features including (1) efficiently handling extremely large datasets (>100,000 cases and >1,000 features); (2) generalizing the internal and external validation steps; (3) expanding the set of base-learners for joint ensemble prediction; (4) introducing an automated selection of CBDA specifications; and (5) providing mechanisms to assess CBDA convergence, evaluate the prediction accuracy, and measure result consistency. To ground the mathematical model and the corresponding computational algorithm, CBDA 2.0 validation utilizes synthetic datasets as well as a population-wide census-like study. Specifically, an empirical validation of the CBDA technique is based on a translational health research using a large-scale clinical study (UK Biobank), which includes imaging, cognitive, and clinical assessment data. The UK Biobank archive presents several difficult challenges related to the aggregation, harmonization, modeling, and interrogation of the information. These problems are related to the complex longitudinal structure, variable heterogeneity, feature multicollinearity, incongruency, and missingness, as well as violations of classical parametric assumptions. Our results show the scalability, efficiency, and usability of CBDA to interrogate complex data into structural information leading to derived knowledge and translational action. Applying CBDA 2.0 to the UK Biobank case-study allows predicting various outcomes of interest, e.g., mood disorders and irritability, and suggests new and exciting avenues of evidence-based research in the context of identifying, tracking, and treating mental health and aging-related diseases. Following open-science principles, we share the entire end-to-end protocol, source-code, and results. This facilitates independent validation, result reproducibility, and team-based collaborative discovery.

Highlights

Data Science is an emerging transdisciplinary field connecting the theoretical, computational, experimental, biomedical, social, environmental and economic areas
Our results show the scalability, efficiency, and usability of Compressive Big Data Analytics (CBDA) to interrogate complex data into structural information leading to derived knowledge and translational action
This validation workflow runs on the LONI pipeline environment [18], a free platform for high performance computing, which allows the simultaneous submission of hundreds of independent components of the CBDA protocol

Summary

Introduction

Data Science is an emerging transdisciplinary field connecting the theoretical, computational, experimental, biomedical, social, environmental and economic areas. It deals with enormous amounts of complex, incongruent, and dynamic data (Big Data) from multiple sources and aims to develop algorithms, methods, tools, and services capable of ingesting such datasets and generating semi-automated decision support systems. Other significant hurdles and gaps pertain to the nature of Big Data and the tools and methods to handle them. Examples of the former are Big Data heterogeneity [1], noise concentration [2], spurious correlations [3], among others. Advanced tools and ensemble methods to handle large, time-varying, and heterogeneous datasets rely on robust predictive models, the specification and implementation of optimal, feasible, scalable, and convergent algorithms, advanced computational workflow protocols, access to appropriate computational resources, and scalable infrastructure

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PloS one	Publication Date: Aug 28, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

Pancreatitis-Associated PRSS1-PRSS2 Haplotype Alters T-Cell Receptor Beta (TRB) Repertoire More Strongly Than PRSS1 Expression
Dongni Fu ... Robert Lafyatis
Gastroenterology | VOL. 164
Dongni Fu, et. al.Dongni Fu ... Robert Lafyatis
01 Oct 2022
Gastroenterology | VOL. 164

Association of asthma and its genetic predisposition with the risk of severe COVID-19
Zhaozhong Zhu ... Liming Liang
Journal of Allergy and Clinical Immunology | VOL. 146
Zhaozhong Zhu, et. al.Zhaozhong Zhu ... Liming Liang
06 Jun 2020
Journal of Allergy and Clinical Immunology | VOL. 146

Determining the Geotechnical Slope Failure Factors via Ensemble and Individual Machine Learning Techniques: A Case Study in Mandi, India
Naresh Mali ... Varun Dutt
Frontiers in Earth Science | VOL. 9
Naresh Mali, et. al.Naresh Mali ... Varun Dutt
15 Sep 2021
Frontiers in Earth Science | VOL. 9

Evaluation of Early-Life Factors and Early-Onset Colorectal Cancer Among Men and Women in the UK Biobank
Valerie Gausman ... Kelli O’Connell
Gastroenterology | VOL. 162
Valerie Gausman, et. al.Valerie Gausman ... Kelli O’Connell
27 Nov 2021
Gastroenterology | VOL. 162

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one