Hierarchical Clustering of Shotgun Proteomics Data

Ville R Koskinen,David M Creasy,Patrick A Emery,John S Cottrell

doi:10.1074/mcp.m110.003822

Ville R Koskinen, David M Creasy + Show 2 more

Open Access

https://doi.org/10.1074/mcp.m110.003822

Copy DOI

Abstract

A new result report for Mascot search results is described. A greedy set cover algorithm is used to create a minimal set of proteins, which is then grouped into families on the basis of shared peptide matches. Protein families with multiple members are represented by dendrograms, generated by hierarchical clustering using the score of the nonshared peptide matches as a distance metric. The peptide matches to the proteins in a family can be compared side by side to assess the experimental evidence for each protein. If the evidence for a particular family member is considered inadequate, the dendrogram can be cut to reduce the number of distinct family members.

Highlights

A comprehensive description of the “Protein Inference Problem” can be found in the review by Nesvizhskii and Aebersold [1]
Computational tools for protein inference and estimation of protein false discovery rate (FDR) have been reviewed by Li et al [2], who observed that they can be categorized as deterministic approaches (DBParser [3], Mass Sieve [4], EPIR [5], Isoform Resolver [6], DTASelect [7], ProteinScape [8], IDPicker [9, 10], PROVALT [11]) or probabilistic approaches (Qscore [12], PRISM [13], ProteinProphet [14], PRO_PROBE [15], PANORAMICS [16], and EBP [17])
Other approaches include protein interaction network information as a basis for accepting protein identifications that might otherwise be rejected as unsafe, such as proteins identified by a single peptide [21]; spectral networks, in which overlapping uninterpreted MS/MS spectra are combined into longer chains, mapped directly to protein sequences [22]; the classification of peptides according to a fully characterized gene model [23, 24]; and MAYU analysis to estimate the FDR for an existing set of protein identifications [25]

Summary

EXPERIMENTAL PROCEDURES

Searches of a public domain data set distributed for the Association of Biomolecular Resource Facilities iPRG2008 study [30] are used to illustrate points in the discussion. Peak lists were generated by iPRG committee members in a variety of formats. The Mascot Generic Format peak list set used here was downloaded from https://www.abrf.org/index.cfm/group. Automatic decoy mode was used, which generates and searches a separate database of random sequences in which the number of entries and the length of each entry is the same as in the target database. Search parameters were: Enzyme : Trypsin/P Fixed modifications : iTRAQ4plex (K),iTRAQ4plex (N-term),Methylthio (C) Variable modifications : Acetyl (Protein N-term),Gln-Ͼpyro-Glu (Nterm Q),Oxidation (M) Mass values : Monoisotopic Peptide Mass Tolerance : Ϯ 0.9 Da Fragment Mass Tolerance : Ϯ 0.6 Da Max Missed Cleavages : 1 Instrument type : ESI-TRAP Number of queries : 33,191 Modification names and compositions are taken from Unimod (http://www.unimod.org). The report is generated by a Perl script that calls the Mascot Parser library to read data from the Mascot result file

RESULTS AND DISCUSSION

If at least one of p ’s peptides is contained by a protein in S1

CONCLUSIONS