Unique function words characterize genomic proteins

Andrea Scaiewicz,Michael Levitt

doi:10.1073/pnas.1801182115

Abstract

Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of "words" or UFWs (57% shared), the "sentences" (MDAs) are different (1.3% shared).

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Proceedings of the National Academy of Sciences of the United States of America	Publication Date: Jun 12, 2018
Citations: 10	License type: CC BY-NC-ND 4.0

R Discovery Prime

R Discovery Prime

Unique function words characterize genomic proteins

Abstract

Talk to us

Similar Papers

More From: Proceedings of the National Academy of Sciences of the United States of America

Lead the way for us

Similar Papers

CDART: protein homology by domain architecture.
Lewis Y Geer ... Stephen H Bryant
Genome research | VOL. 12
Lewis Y Geer, et. al.Lewis Y Geer ... Stephen H Bryant
01 Oct 2002
Genome research | VOL. 12

Blast sampling for structural and functional analyses
Anne Friedrich ... Emmanuel Bettler
BMC bioinformatics | VOL. 8
Anne Friedrich, et. al.Anne Friedrich ... Emmanuel Bettler
23 Feb 2007
BMC bioinformatics | VOL. 8

FASTAptamer: A Bioinformatic Toolkit for High-throughput Sequence Analysis of Combinatorial Selections.
Khalid K Alam ... Jonathan L Chang
Molecular Therapy—Nucleic Acids | VOL. 4
Khalid K Alam, et. al.Khalid K Alam ... Jonathan L Chang
01 Jan 2015
Molecular Therapy—Nucleic Acids | VOL. 4

Molecular phylogeny of the kelch-repeat superfamily reveals an expansion of BTB/kelch proteins in animals
Soren Prag ... Josephine C Adams
BMC bioinformatics | VOL. 4
Soren Prag, et. al.Soren Prag ... Josephine C Adams
01 Jan 2003
BMC bioinformatics | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unique function words characterize genomic proteins

Abstract

Talk to us

Similar Papers

More From: Proceedings of the National Academy of Sciences of the United States of America