Abstract

BackgroundAs a promising way to transform medicine, mass spectrometry based proteomics technologies have seen a great progress in identifying disease biomarkers for clinical diagnosis and prognosis. However, there is a lack of effective feature selection methods that are able to capture essential data behaviors to achieve clinical level disease diagnosis. Moreover, it faces a challenge from data reproducibility, which means that no two independent studies have been found to produce same proteomic patterns. Such reproducibility issue causes the identified biomarker patterns to lose repeatability and prevents it from real clinical usage.MethodsIn this work, we propose a novel machine-learning algorithm: derivative component analysis (DCA) for high-dimensional mass spectral proteomic profiles. As an implicit feature selection algorithm, derivative component analysis examines input proteomics data in a multi-resolution approach by seeking its derivatives to capture latent data characteristics and conduct de-noising. We further demonstrate DCA's advantages in disease diagnosis by viewing input proteomics data as a profile biomarker via integrating it with support vector machines to tackle the reproducibility issue, besides comparing it with state-of-the-art peers.ResultsOur results show that high-dimensional proteomics data are actually linearly separable under proposed derivative component analysis (DCA). As a novel multi-resolution feature selection algorithm, DCA not only overcomes the weakness of the traditional methods in subtle data behavior discovery, but also suggests an effective resolution to overcoming proteomics data's reproducibility problem and provides new techniques and insights in translational bioinformatics and machine learning. The DCA-based profile biomarker diagnosis makes clinical level diagnostic performances reproducible across different proteomic data, which is more robust and systematic than the existing biomarker discovery based diagnosis.ConclusionsOur findings demonstrate the feasibility and power of the proposed DCA-based profile biomarker diagnosis in achieving high sensitivity and conquering the data reproducibility issue in serum proteomics. Furthermore, our proposed derivative component analysis suggests the subtle data characteristics gleaning and de-noising are essential in separating true signals from red herrings for high-dimensional proteomic profiles, which can be more important than the conventional feature selection or dimension reduction. In particular, our profile biomarker diagnosis can be generalized to other omics data for derivative component analysis (DCA)'s nature of generic data analysis.

Highlights

  • As a promising way to transform medicine, mass spectrometry based proteomics technologies have seen a great progress in identifying disease biomarkers for clinical diagnosis and prognosis

  • High-performance disease phenotype discrimination remains a challenge in translational bioinformatics due to special characteristics of serum proteomics data, in addition to its well-known data reproducibility issue, which means that no two independent studies have been found to produce same proteomic patterns [3,4,5]

  • We propose novel derivative component analysis based support vector machines (DCA-SVM) to handle serum proteomic disease diagnosis, which is equivalent to a binary or multi-class classification problem

Read more

Summary

Introduction

As a promising way to transform medicine, mass spectrometry based proteomics technologies have seen a great progress in identifying disease biomarkers for clinical diagnosis and prognosis. There is a lack of effective feature selection methods that are able to capture essential data behaviors to achieve clinical level disease diagnosis. It faces a challenge from data reproducibility, which means that no two independent studies have been found to produce same proteomic patterns. There are a large amount of m/z ratios (peptides or proteins), only a few numbers of them (e.g., peaks) have meaningful contribution to disease diagnosis and data variations Such data are not noise-free because normalization methods cannot remove built-in systems noise from mass spectrometry technology itself [6,7]. The high-dimensionality directly prevents conventional classification algorithms from achieving clinical rivaling disease diagnosis, limits its generalization capability or even causes some regularity problem in classification [7]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call