Abstract
BackgroundThe currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”.ResultsHere we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain.ConclusionsRecent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data.
Highlights
The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding
Further defining “dark matter” of the protein sequence universe Currently defined “dark matter” of the protein sequence universe includes protein sequences that cannot be matched to any known protein family [1]
We argue that while Conserved Domain Database (CDD) is superior in overall computational coverage, it may not be the best choice for defining protein domains
Summary
The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain. The currently known protein space, which is a part of the protein universe that has been revealed by DNA sequencing, consists of more than 16 million protein sequences in a non-redundant (nr) database (December 8, 2011) and its size is rapidly increasing due to recent technological advances [4,5]. A small fraction of the current protein space can be analyzed by traditional experimental techniques computational classification of protein sequences and their assignment to known biological functions is critical [6,7].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.