Abstract

Rapid advances in high-throughput sequencing technology have led to the generation of a large number of multi-omics biological datasets. Integrating data from different omics provides an unprecedented opportunity to gain insight into disease mechanisms from different perspectives. However, integrative analysis and predictive modeling from multi-omics data are facing three major challenges: i) heavy noises; ii) the high dimensions compared to the small samples; iii) data heterogeneity. Current multi-omics data integration approaches have some limitations and are susceptible to heavy noise. In this paper, we present MSPL, a robust supervised multi-omics data integration method that simultaneously identifies significant multi-omics signatures during the integration process and predicts the cancer subtypes. The proposed method not only inherits the generalization performance of self-paced learning but also leverages the properties of multi-omics data containing correlated information to interactively recommend high-confidence samples for model training. We demonstrate the capabilities of MSPL using simulated data and five multi-omics biological datasets, integrating up three omics to identify potential biological signatures, and evaluating the performance compared to state-of-the-art methods in binary and multi-class classification problems. Our proposed model makes multi-omics data integration more systematic and expands its range of applications.

Highlights

  • Driven by the development of new high-throughput sequencing techniques, various types of biological data with different formats, sizes, and structures have been increasing at an unprecedented rate

  • We demonstrate the capability of Multimodal Self-paced Learning (MSPL) and compare its prediction and feature selection performance with other stateof-the-art methods using simulated data and five publicly available multi-omics datasets, including four benchmark cancer datasets and one breast cancer multi-omics dataset

  • We evaluate the capability of the proposed MSPL model and compare its performance with other state-of-the-art methods

Read more

Summary

Introduction

Driven by the development of new high-throughput sequencing techniques, various types of biological data with different formats, sizes, and structures have been increasing at an unprecedented rate. MiRNA expression, proteins, DNA methylation and metabolites are some examples of biological data produced by using high-throughput techniques such as microarray [1] and mass spectrometry [2]. Each of these distinct biological data types provides different, partially independent and complementary information of the entire genome [3]. Deciphering complex human genomes and gene functions may require more complete and complementary information than those are provided by single type of data. The integration of multi-omics data (e.g. genomics, transcriptomics, proteomics and metabolomics, etc.) provides an unprecedented opportu-. Nity to gain insight into complex disease mechanisms from different views and levels, predict the subtype of the target disease, and discover potential multi-omics biological signatures [4]–[6]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call