A predictive model for vertebrate bone identification from collagen using proteomic mass spectrometry

Heyi Yang,David Fenyö,Erin R Butler,Donald Siegel,Jennifer Teubl,Samantha A Monier,Beatrix Ueberheide

doi:10.1038/s41598-021-90231-5

Heyi Yang, David Fenyö + Show 5 more

Open Access

https://doi.org/10.1038/s41598-021-90231-5

Copy DOI

Abstract

Proteogenomics is an increasingly common method for species identification as it allows for rapid and inexpensive interrogation of an unknown organism’s proteome—even when the proteome is partially degraded. The proteomic method typically uses tandem mass spectrometry to survey all peptides detectable in a sample that frequently contains hundreds or thousands of proteins. Species identification is based on detection of a small numbers of species-specific peptides. Genetic analysis of proteins by mass spectrometry, however, is a developing field, and the bone proteome, typically consisting of only two proteins, pushes the limits of this technology. Nearly 20% of highly confident spectra from modern human bone samples identify non-human species when searched against a vertebrate database—as would be necessary with a fragment of unknown bone. These non-human peptides are often the result of current limitations in mass spectrometry or algorithm interpretation errors. Consequently, it is difficult to know if a “species-specific” peptide used to identify a sample is actually present in that sample. Here we evaluate the causes of peptide sequence errors and propose an unbiased, probabilistic approach to determine the likelihood that a species is correctly identified from bone without relying on species-specific peptides.

Highlights

Proteogenomics is an increasingly common method for species identification as it allows for rapid and inexpensive interrogation of an unknown organism’s proteome—even when the proteome is partially degraded
The problem of determining species from an unknown bone is compounded as large protein databases must be searched which inevitably leads to the identification of false positives resulting from the limitations of mass spectrometry and interpreting a lgorithms[7,8]
Identifying vertebrates from bone samples using proteins pushes against the limitations of all these criteria

Summary

Introduction

Proteogenomics is an increasingly common method for species identification as it allows for rapid and inexpensive interrogation of an unknown organism’s proteome—even when the proteome is partially degraded. The problem of determining species from an unknown bone is compounded as large protein databases must be searched (e.g. vertebrate or mammal) which inevitably leads to the identification of false positives resulting from the limitations of mass spectrometry (e.g. fragmentation efficiency) and interpreting a lgorithms[7,8]. Data presented here demonstrate that an unbiased analysis of all highly confident collagen spectra (i.e. not weighting them for species specificity), using a logistic regression classifier, represents a new method for vertebrate bone taxonomic identification, as it does not employ comparing individual species d atabases[1], but rather uses empirical data including the large numbers of false positives that are inevitably detected from single source samples. As databases become more complete, accurate identification of more taxa to the species level with only limited peptide coverage will improve (Supplemental Table S5)

Methods

Results

Conclusion