Abstract

We present the Proteome Quality Index (PQI; http://pqi-list.org), a much-needed resource for users of bacterial and eukaryotic proteomes. Completely sequenced genomes for which there is an available set of protein sequences (the proteome) are given a one- to five-star rating supported by 11 different metrics of quality. The database indexes over 3000 proteomes at the time of writing and is provided via a website for browsing, filtering and downloading. Previous to this work, there was no systematic way to account for the large variability in quality of the thousands of proteomes, and this is likely to have profoundly influenced the outcome of many published studies, in particular large-scale comparative analyses. The lack of a measure of proteome quality is likely due to the difficulty in producing one, a problem that we have approached by integrating multiple metrics. The continued development and improvement of the index will require the contribution of additional metrics by us and by others; the PQI provides a useful point of reference for the scientific community, but it is only the first step towards a 'standard' for the field.

Highlights

  • There is a strong need in the scientific community for ways to quantify the quality of protein sequence datasets deduced from the sequenced genomes

  • Because there is an enormous variability in the quality and consistency of proteomes, both in terms of the individual sequences of each protein and in terms of the completeness of the protein collection and how representative it is of the proteins in the complete genome (Chothia and Gough, 2009). In other fields, such as nucleic acid sequencing, 3D protein structure determination or collection of gene expression data, there have been community-wide agreements settled among journals, data repositories [e.g. the International Nucleotide Sequence Database Collaboration (Nakamura et al, 2013), Protein Data Bank (Rose et al, 2013) or Gene Expression Omnibus (Barrett et al, 2013)], funding bodies and scientists

  • We propose a concrete starting point: the Proteome Quality Index (PQI) database, which is largely based on our SUPERFAMILY database

Read more

Summary

Introduction

There is a strong need in the scientific community for ways to quantify the quality of protein sequence datasets deduced from the sequenced genomes. This need arises because there is an enormous variability in the quality and consistency of proteomes, both in terms of the individual sequences of each protein and in terms of the completeness of the protein collection and how representative it is of the proteins in the complete genome (Chothia and Gough, 2009) In other fields, such as nucleic acid sequencing, 3D protein structure determination or collection of gene expression data, there have been community-wide agreements settled among journals, data repositories [e.g. the International Nucleotide Sequence Database Collaboration (Nakamura et al, 2013), Protein Data Bank (Rose et al, 2013) or Gene Expression Omnibus (Barrett et al, 2013)], funding bodies and scientists. This is largely because of a lack of metrics by which the quality of a ‘complete’ proteome can be systematically assessed

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call