PRIDE: Quality control in a proteomics data repository

A Csordas,R Wang,H Hermjakob,D Rios,D Ovelleiro,J M Foster,J A Vizcaino

doi:10.1093/database/bas004

A Csordas, R Wang + Show 5 more

Open Access

https://doi.org/10.1093/database/bas004

Copy DOI

Journal: Database	Publication Date: Mar 20, 2012
Citations: 38	License type: cc-by

Affiliation: European Bioinformatics Institute, Wellcome Trust

Abstract

The PRoteomics IDEntifications (PRIDE) database is a large public proteomics data repository, containing over 270 million mass spectra (by November 2011). PRIDE is an archival database, providing the proteomics data supporting specific scientific publications in a computationally accessible manner. While PRIDE faces rapid increases in data deposition size as well as number of depositions, the major challenge is to ensure a high quality of data depositions in the context of highly diverse proteomics work flows and data representations. Here, we describe the PRIDE curation pipeline and its practical application in quality control of complex data depositions.Database URL: http://www.ebi.ac.uk/pride/.

Highlights

Proteomics can be defined as ‘the study of the subsets of proteins present in different parts of the organism and how they change with time and varying conditions’ (1)
The situation is already improving significantly as a result of the Human Proteome Organization Proteomics Standards Initiative (PSI) developing the standard formats mzML (2) and mzIdentML (3), which are becoming increasingly implemented by instrument and search engine producers
While generation and public availability of proteomics data are still, several orders of magnitude smaller than e.g. genomics data, both quantity and complexity of proteomics data sets deposited in the PRoteomics IDEntifications (PRIDE) database are rapidly increasing

Summary

Original article

Attila Csordas*, David Ovelleiro, Rui Wang, Joseph M. The PRoteomics IDEntifications (PRIDE) database is a large public proteomics data repository, containing over 270 million mass spectra (by November 2011). PRIDE is an archival database, providing the proteomics data supporting specific scientific publications in a computationally accessible manner. While PRIDE faces rapid increases in data deposition size as well as number of depositions, the major challenge is to ensure a high quality of data depositions in the context of highly diverse proteomics work flows and data representations. We describe the PRIDE curation pipeline and its practical application in quality control of complex data depositions.

Introduction

The PRIDE curation pipeline

Frequent data quality issues

PRIDE curation snippets

Conclusion