Abstract

BackgroundHigh-throughput sequencing data are widely collected and analyzed in the study of complex diseases in quest of improving human health. Well-studied algorithms mostly deal with single data source, and cannot fully utilize the potential of these multi-omics data sources. In order to provide a holistic understanding of human health and diseases, it is necessary to integrate multiple data sources. Several algorithms have been proposed so far, however, a comprehensive comparison of data integration algorithms for classification of binary traits is currently lacking.ResultsIn this paper, we focus on two common classes of integration algorithms, graph-based that depict relationships with subjects denoted by nodes and relationships denoted by edges, and kernel-based that can generate a classifier in feature space. Our paper provides a comprehensive comparison of their performance in terms of various measurements of classification accuracy and computation time. Seven different integration algorithms, including graph-based semi-supervised learning, graph sharpening integration, composite association network, Bayesian network, semi-definite programming-support vector machine (SDP-SVM), relevance vector machine (RVM) and Ada-boost relevance vector machine are compared and evaluated with hypertension and two cancer data sets in our study.In general, kernel-based algorithms create more complex models and require longer computation time, but they tend to perform better than graph-based algorithms. The performance of graph-based algorithms has the advantage of being faster computationally.ConclusionsThe empirical results demonstrate that composite association network, relevance vector machine, and Ada-boost RVM are the better performers. We provide recommendations on how to choose an appropriate algorithm for integrating data from multiple sources.

Highlights

  • High-throughput sequencing data are widely collected and analyzed in the study of complex diseases in quest of improving human health

  • The results compared are based on (1) Pearson correlation matrix; (2) simple Bayesian network and (3) radial basis function kernel with a scaling parameter sigma that is determined by grid search using 5-fold cross validation in the training set

  • The reasons are as following: In our study (1) Spearman’s rank correlation matrix and Pearson correlation matrix are used as weight matrix in graph-based semi-supervised learning, graph sharpening integration, and composite association network, the negative elements in the two correlation

Read more

Summary

Introduction

High-throughput sequencing data are widely collected and analyzed in the study of complex diseases in quest of improving human health. In order to provide a holistic understanding of human health and diseases, it is necessary to integrate multiple data sources. Several algorithms have been proposed so far, a comprehensive comparison of data integration algorithms for classification of binary traits is currently lacking. Ma et al [3] have proposed an effective method for the integrative analysis of DNA-methylation and gene expression in epigenetic modules. Graph and kernel methods are common ways for integrating multiple data sources for the classification of binary traits. Multiple graph- and kernel-based data integration algorithms have been proposed, making the selection of appropriate tools difficult. There is a lack of empirical studies on how the graph- and kernel-based data integration algorithms perform on real data. We want to emphasize that the purpose of this paper is not to identify the best performing algorithm based on different combinations of data sources, but to compare the performance of data integration algorithms given a fixed number of data sources at hand

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call