Automatic protein clustering as a basis of automatic annotation

Naoki Sato

doi:10.1038/npre.2010.5086.1

Abstract

AbstractDevelopment of new generation sequencers enabled genome sequencing feasible for every organism in a laboratory. A typical data flow of de novo seuqencing includes (1) assembly of sequence reads, (2) estimation of open reading frames, (3) annotation of proteins, and (4) finding RNA genes. The annotation is normally performed by BLASTP searches against several different databases. However, it is usually hard to find a plausible annotation by just looking at the results of BLASTP searches.Here I propose a potentially automatic method of annotation that exploits automatic protein clustering using the software GCLUST, which estimates proper similarity threshold for each list of homologs using ‘entropy-optimized organism count’ method (Sato 2009). The software has been used to construct a homolog database including both prokaryotic and eukaryotic proteins ("http://gclust.c.u-tokyo.ac.jp/":http://gclust.c.u-tokyo.ac.jp/). For use in genome annotation, we need de novo clustering including many genomes of related organisms as well as genomes of representative organisms. Application of protein clustering in the annotation in Arthrospira platensis was the first successful case (Fujisawa et al. 2010). I present here results of protein clustering of total predicted proteins in two draft genomes of cyanobacteria along with total predicted proteins of 41 cyanobacteria available at NCBI. For each of the resultant protein clusters, an alignment and a phylogenetic tree were also prepared for assistance in functional annotation. The quality of alignments was evaluated by counting ill-aligned proteins (missing N- or C-terminus, or insertion/deletion), which was 4-13% of total predicted proteins in most cyanobacterial genomes. Annotation may be automated by extracting significant key words alreadly assigned for member proteins of clusters or by comparison with reference protein clusters.

Highlights

Current way of genome sequencing DNA isolation from bacterial cells Library construction Sequencing (454 etc)3
The annotation is normally performed by BLASTP searches against several different databases
It is usually hard to find a plausible annotation by just looking at the results of BLASTP searches

Summary

Automatic protein clustering as a basis of automatic annotation

Development of new generation sequencers enabled genome sequencing feasible for every organism in a laboratory. A typical data flow of de novo seuqencing includes (1) assembly of sequence reads, (2) estimation of open reading frames, (3) annotation of proteins, and (4) finding RNA genes. The annotation is normally performed by BLASTP searches against several different databases. It is usually hard to find a plausible annotation by just looking at the results of BLASTP searches. I propose a potentially automatic method of annotation that exploits automatic protein clustering using the software GCLUST, which estimates proper similarity threshold for each list of homologs using ‘entropy-optimized organism count’ method (Sato 2009). The software has been used to construct a homolog database including both prokaryotic and eukaryotic proteins (http://gclust.c.utokyo.ac.jp/). For each of the resultant protein clusters, an alignment and a phylogenetic tree were prepared for assistance in functional annotation. Annotation may be automated by extracting significant key words alreadly assigned for member proteins of clusters or by comparison with reference protein clusters

Current way of genome sequencing

Maximal S is defined by

New functional categories of proteins

Main category cellular structure unclassified Total

Findings

DnaJ protein

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automatic protein clustering as a basis of automatic annotation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature Precedings

Lead the way for us

Journal: Nature Precedings	Publication Date: Oct 22, 2010
License type: CC BY 3.0

Similar Papers

Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks.
Nikolai Daraselia ... Sergei Egorov
BMC Bioinformatics | VOL. 8
Nikolai Daraselia, et. al.Nikolai Daraselia ... Sergei Egorov
10 Jul 2007
BMC Bioinformatics | VOL. 8

CPredictor3.0: detecting protein complexes from PPI networks with expression data and functional annotations
Ying Xu ... Jiaogen Zhou
BMC Systems Biology | VOL. 11
Ying Xu, et. al.Ying Xu ... Jiaogen Zhou
01 Dec 2017
BMC Systems Biology | VOL. 11

KinFin: Software for Taxon-Aware Analysis of Clustered Protein Sequences.
Dominik R Laetsch ... Mark L Blaxter
G3 Genes|Genomes|Genetics | VOL. 7
Dominik R Laetsch, et. al.Dominik R Laetsch ... Mark L Blaxter
01 Oct 2017
G3 Genes|Genomes|Genetics | VOL. 7

Genomic and proteomic data integration for comprehensive biodata search
Arif Canakoglu ... Marco Masseroli
EMBnet.journal | VOL. 18
Arif Canakoglu, et. al.Arif Canakoglu ... Marco Masseroli
09 Nov 2012
EMBnet.journal | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic protein clustering as a basis of automatic annotation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature Precedings