Abstract

BackgroundTargeted diagnosis and treatment options are dependent on insights drawn from multi-modal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look of the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes.SolutionTo address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types.PerformanceWe outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on “flattening” complex data structures, and runs efficiently when alternative approaches are unable to perform at all.

Highlights

  • IntroductionImage processing, and medical data management have supported data collection and management within medical institutions

  • Solution: To address these issues and achieve scalable processing of multi-modal biomedical data, we present Transforming Nested Collections Efficiently (TraNCE), a framework that automates the difficulties of designing distributed analyses with complex biomedical data types

  • The standard compilation acts as a baseline for the “shredded compilation”; this compilation route is reflective of current procedures to handle nested data, such as what is provided in Spark-Structured Query Language (SQL) [32] (Spark-SQL, RRID: SCR 016557)

Read more

Summary

Introduction

Image processing, and medical data management have supported data collection and management within medical institutions These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look of the impact of a disease on the underlying system. These are consolidated data sources from hundreds of thousands of patients and counting, such as the 1000 Genomes Project [4], International Cancer Genome Consortium (ICGC) [5], The Cancer Genome Atlas (TCGA) [6], and UK BioBank [7] This scenario has introduced a demand for data processing solutions that can handle such large-scale datasets; scalable data integration and aggregation solutions capable of supporting joint inference play a key role in advancing biomedical analysis

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call