Abstract
AbstractDevelopment of new generation sequencers enabled genome sequencing feasible for every organism in a laboratory. A typical data flow of de novo seuqencing includes (1) assembly of sequence reads, (2) estimation of open reading frames, (3) annotation of proteins, and (4) finding RNA genes. The annotation is normally performed by BLASTP searches against several different databases. However, it is usually hard to find a plausible annotation by just looking at the results of BLASTP searches.Here I propose a potentially automatic method of annotation that exploits automatic protein clustering using the software GCLUST, which estimates proper similarity threshold for each list of homologs using ‘entropy-optimized organism count’ method (Sato 2009). The software has been used to construct a homolog database including both prokaryotic and eukaryotic proteins ("http://gclust.c.u-tokyo.ac.jp/":http://gclust.c.u-tokyo.ac.jp/). For use in genome annotation, we need de novo clustering including many genomes of related organisms as well as genomes of representative organisms. Application of protein clustering in the annotation in Arthrospira platensis was the first successful case (Fujisawa et al. 2010). I present here results of protein clustering of total predicted proteins in two draft genomes of cyanobacteria along with total predicted proteins of 41 cyanobacteria available at NCBI. For each of the resultant protein clusters, an alignment and a phylogenetic tree were also prepared for assistance in functional annotation. The quality of alignments was evaluated by counting ill-aligned proteins (missing N- or C-terminus, or insertion/deletion), which was 4-13% of total predicted proteins in most cyanobacterial genomes. Annotation may be automated by extracting significant key words alreadly assigned for member proteins of clusters or by comparison with reference protein clusters.
Highlights
Current way of genome sequencing DNA isolation from bacterial cells Library construction Sequencing (454 etc)3
The annotation is normally performed by BLASTP searches against several different databases
It is usually hard to find a plausible annotation by just looking at the results of BLASTP searches
Summary
Development of new generation sequencers enabled genome sequencing feasible for every organism in a laboratory. A typical data flow of de novo seuqencing includes (1) assembly of sequence reads, (2) estimation of open reading frames, (3) annotation of proteins, and (4) finding RNA genes. The annotation is normally performed by BLASTP searches against several different databases. It is usually hard to find a plausible annotation by just looking at the results of BLASTP searches. I propose a potentially automatic method of annotation that exploits automatic protein clustering using the software GCLUST, which estimates proper similarity threshold for each list of homologs using ‘entropy-optimized organism count’ method (Sato 2009). The software has been used to construct a homolog database including both prokaryotic and eukaryotic proteins (http://gclust.c.utokyo.ac.jp/). For each of the resultant protein clusters, an alignment and a phylogenetic tree were prepared for assistance in functional annotation. Annotation may be automated by extracting significant key words alreadly assigned for member proteins of clusters or by comparison with reference protein clusters
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.