A novel bioinformatics method for efficient knowledge discovery by BLSOM from big genomic sequence data.

Yu Bai,Shigehiko Kanaya,Toshimichi Ikemura,Yuki Iwasaki,Yue Zhao

doi:10.1155/2014/765648

Abstract

With remarkable increase of genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional data such as oligonucleotide composition on one map. By modifying the conventional SOM, we have previously developed Batch-Learning SOM (BLSOM), which allows classification of sequence fragments according to species, solely depending on the oligonucleotide composition. In the present study, we introduce the oligonucleotide BLSOM used for characterization of vertebrate genome sequences. We first analyzed pentanucleotide compositions in 100 kb sequences derived from a wide range of vertebrate genomes and then the compositions in the human and mouse genomes in order to investigate an efficient method for detecting differences between the closely related genomes. BLSOM can recognize the species-specific key combination of oligonucleotide frequencies in each genome, which is called a “genome signature,” and the specific regions specifically enriched in transcription-factor-binding sequences. Because the classification and visualization power is very high, BLSOM is an efficient powerful tool for extracting a wide range of information from massive amounts of genomic sequences (i.e., big sequence data).

Highlights

IntroductionBoth protein coding and non-coding parts of the sequences, contain a wealth of information
Genome sequences, both protein coding and non-coding parts of the sequences, contain a wealth of information
In the original Kohonen’s Self-Organizing Map (SOM), the initial vectorial data were set by random values, but in the Batch-Learning SOM (BLSOM) the initial vectors are set based on the widest scale of the sequence distribution in the oligonucleotide frequency space with the principal component analysis (PCA) [13]

Summary

Introduction

Both protein coding and non-coding parts of the sequences, contain a wealth of information. Various linguistic tools for analyzing DNA sequence have been developed [8, 9]. Unsupervised neural network algorithm, Kohonen’s Self-Organizing Map (SOM), is a powerful tool for clustering and visualizing high-dimensional complex data on a two-dimensional map [10,11,12]. On the basis of batch learning SOM, we have previously developed a modification of the conventional SOM for genome and gene sequence analyses, which makes the learning process and resulting map independent of the order of data input: BLSOM [13,14,15]. BLSOM is suitable for actualizing highperformance parallel-computing and, can analyze big sequence data such as millions of genomic sequences simultaneously [16]

Methods

Results

Discussion

Conclusion