Abstract

BackgroundHigh-throughput proteomics techniques, such as mass spectrometry (MS)-based approaches, produce very high-dimensional data-sets. In a clinical setting one is often interested in how mass spectra differ between patients of different classes, for example spectra from healthy patients vs. spectra from patients having a particular disease. Machine learning algorithms are needed to (a) identify these discriminating features and (b) classify unknown spectra based on this feature set. Since the acquired data is usually noisy, the algorithms should be robust against noise and outliers, while the identified feature set should be as small as possible.ResultsWe present a new algorithm, Sparse Proteomics Analysis (SPA), based on the theory of compressed sensing that allows us to identify a minimal discriminating set of features from mass spectrometry data-sets. We show (1) how our method performs on artificial and real-world data-sets, (2) that its performance is competitive with standard (and widely used) algorithms for analyzing proteomics data, and (3) that it is robust against random and systematic noise. We further demonstrate the applicability of our algorithm to two previously published clinical data-sets.

Highlights

  • High-throughput proteomics techniques, such as mass spectrometry (MS)-based approaches, produce very high-dimensional data-sets

  • Feature selection from simulated data-sets we assess our framework of Sparse Proteomics Analysis (SPA) with regard to a typical situation in mass-spectrometry analysis: We would like to extract discriminating features from MS data with respect to two groups

  • Workflows for analyzing high-dimensional data often contain a step where discriminating features between two groups need to be identified. This is important for applications such as classification and clustering but is essential for understanding biological differences, e.g. between two phenotypes

Read more

Summary

Introduction

High-throughput proteomics techniques, such as mass spectrometry (MS)-based approaches, produce very high-dimensional data-sets. High-throughput assays systems for measuring a variety of different biological sources have become standard in modern laboratories. This allows for the quick and cheap creation of very large data-sets which characterize for example the status of a cell by its billions of constituents, e.g. nucleotides, RNAs, contained proteins, or metabolites. Many disease-relevant mechanisms are controlled by proteins (e.g. hormones), which can be detected in biological samples (blood, urine, etc.) using mass spectrometry (MS). This technique allows (potentially) for monitoring the entire set of proteins—the so-called proteome—in. Due to its wide availability in hospitals, MS-based proteomics can bring the wave of progress in diagnostics, since even subtle changes in the proteome can be detected and linked to disease onset and progression [1,2,3,4]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call