Abstract

A strategy of evolutionary studies that can compare vast numbers of genome sequences is becoming increasingly important with the remarkable progress of high-throughput DNA sequencing methods. We previously established a sequence alignment-free clustering method “BLSOM” for di-, tri-, and tetranucleotide compositions in genome sequences, which can characterize sequence characteristics (genome signatures) of a wide range of species. In the present study, we generated BLSOMs for tetra- and pentanucleotide compositions in approximately one million sequence fragments derived from 101 eukaryotes, for which almost complete genome sequences were available. BLSOM recognized phylotype-specific characteristics (e.g., key combinations of oligonucleotide frequencies) in the genome sequences, permitting phylotype-specific clustering of the sequences without any information regarding the species. In our detailed examination of 12 Drosophila species, the correlation between their phylogenetic classification and the classification on the BLSOMs was observed to visualize oligonucleotides diagnostic for species-specific clustering.

Highlights

  • Genome sequences, even protein-noncoding sequences, contain a wealth of information

  • This alignment-free clustering method was successfully applied to the phylogenetic classification of genome sequence fragments [15] and BioMed Research International to the analysis of a large number of microbial sequences obtained by metagenome studies of environmental and clinical samples [16]

  • To investigate the clustering capacity of Batch-Learning Self-Organizing Map (BLSOM) for sequences derived from a wide range of eukaryotes, we first analyzed tetra- and pentanucleotide frequencies in ca. 1,800,000 nonoverlapping 5 kb sequences as well as ca. 900,000 nonoverlapping 10 kb sequences and overlapping 100 kb sequences with a 10 kb sliding step from 101 eukaryotic genomes, most of which were completely sequenced

Read more

Summary

Introduction

Even protein-noncoding sequences, contain a wealth of information. The G + C content (%GC) is a fundamental characteristic of individual genomes and is used for a long period as a basic phylogenetic parameter to characterize individual genomes and genomic portions. BLSOM could recognize and visualize species-specific characteristics of codon or oligonucleotide frequencies in individual genomes, permitting clustering of genes or genome fragments according to species without the need for species information during BLSOM learning. Various high-performance supercomputers are available for biological studies, and the BLSOM is suitable for actualizing high-performance parallel-computing with highperformance supercomputers. This alignment-free clustering method was successfully applied to the phylogenetic classification of genome sequence fragments [15] and BioMed Research International to the analysis of a large number of microbial sequences obtained by metagenome studies of environmental and clinical samples [16]. We constructed BLSOM with tetraand pentanucleotide compositions in most (if not all) of the Drosophila genomes available and focused on speciesspecific characteristics of oligonucleotide frequencies in each Drosophila genome (genome signature), in connection with their phylogenetic classification

Materials and Methods
Results
Discussion and Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call