Abstract

Dimension reduction (DR) methods play an inevitable role in analyzing and visualizing high-dimensional multi-source data. In the recent decades many variants of these methods have been developed in various disciplines and domains. Due to the diversity and an ever-increasing number of developed techniques, choosing the right method for the given problem is a difficult task. In this study we benchmark 87 methods for integrative dimension reduction of mRNA expression and DNA methylation data, which is a common problem in biology and medicine. Our ranking is obtained based on four main factors: quality of dimension reduction (local, global, and local-global neighborhood preservation), clustering quality, speed and sensitivity to input parameters on multiple datasets generated by InterSIM (a semi-realistic multi-source data simulator in the genomics domain). The results are later validated on a real dataset for breast cancer through visual evaluation metrics such as co-ranking matrices, inspection of true cancer sub-types in two-dimensional projections, and LCMC curves. We also demonstrate the relationship between the methods via network inference. The findings in this study can be useful in algorithm selection and planning of experimental design in multi-source data analysis.

Highlights

  • Analysis of data from multiple sources is a rapidly emerging area with an ever-increasing role in biology and medicine, data integration has become an important research area

  • We extracted a list of methods that can be used for integrative data analysis, which are selected from different families: Dimension Reduction (DR) methods, Non-negative Matrix Factorization (NMF), Joint Matrix Factorization (JMF), Joint Non-negative Matrix Factorization (JNMF), Multi-Block data methods (MB), Bayesian Multi-Block models (BMB), and Joint/Separated Matrix Factorization (JSMF)

  • The Local Continuity Meta-Criterion (LCMC) [6] is a parameter-free and widely accepted quality measure for dimension reduction for single-view datasets. It can be defined as the average number of overlaps between the k-nearest neighbors in the high-dimensional space and the low-dimensional projection

Read more

Summary

Introduction

Analysis of data from multiple sources is a rapidly emerging area with an ever-increasing role in biology and medicine, data integration has become an important research area. An important problem in omics data analysis is the high dimensionality of the data. The main goal in DR is to minimize the distance between the points in a high-dimensional space and the points in a low-dimensional projection of the same data. A good DR method should produce a lower-dimensional projection that is faithful to the original high-dimensional space. This faithfulness is relative and depending on the goal of the analysis, one might concentrate on preserving the local or the global neighborhoods. For data coming from multiple sources (or views) this becomes even more complicated, because

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.