Dynamics of domain coverage of the protein sequence universe

Bhanu Rekapalli,Gregory D Peterson,Kristin Wuichet,Igor B Zhulin

doi:10.1186/1471-2164-13-634

Abstract

BackgroundThe currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”.ResultsHere we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain.ConclusionsRecent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data.

Highlights

The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding
Further defining “dark matter” of the protein sequence universe Currently defined “dark matter” of the protein sequence universe includes protein sequences that cannot be matched to any known protein family [1]
We argue that while Conserved Domain Database (CDD) is superior in overall computational coverage, it may not be the best choice for defining protein domains

Summary

Introduction

The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain. The currently known protein space, which is a part of the protein universe that has been revealed by DNA sequencing, consists of more than 16 million protein sequences in a non-redundant (nr) database (December 8, 2011) and its size is rapidly increasing due to recent technological advances [4,5]. A small fraction of the current protein space can be analyzed by traditional experimental techniques computational classification of protein sequences and their assignment to known biological functions is critical [6,7].

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Nov 16, 2012
Citations: 40	License type: cc-by

R Discovery Prime

R Discovery Prime

Dynamics of domain coverage of the protein sequence universe

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Functional Annotation of Proteins using Domain Embedding based Sequence Classification
Bishnu Sarker ... Sabeur Aridhi
-
Bishnu Sarker, et. al.Bishnu Sarker ... Sabeur Aridhi
01 Jan 2019
01 Jan 2019

A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions.
Armen Abnousi ... Ananth Kalyanaraman
PLOS ONE | VOL. 11
Armen Abnousi, et. al.Armen Abnousi ... Ananth Kalyanaraman
23 Aug 2016
PLOS ONE | VOL. 11

Overview of HBV whole genome data in public repositories and the Chinese HBV reference sequences
Guanghua Wu ... Changqing Zeng
Progress in Natural Science | VOL. 18
Guanghua Wu, et. al.Guanghua Wu ... Changqing Zeng
05 Dec 2007
Progress in Natural Science | VOL. 18

Unmatched sequences in public databases - exemplified by tuberculin-active protein.
H G Wiker
Scandinavian Journal of Immunology | VOL. 59
H G WikerH G Wiker
01 Apr 2004
Scandinavian Journal of Immunology | VOL. 59

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dynamics of domain coverage of the protein sequence universe

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics