Alignment-free genomic sequence comparison using FCGR and signal processing

Daniel Lichtblau

doi:10.1186/s12859-019-3330-3

Daniel Lichtblau

Open Access

https://doi.org/10.1186/s12859-019-3330-3

Copy DOI

Journal: BMC bioinformatics	Publication Date: Dec 1, 2019
Citations: 21	License type: open-access

Affiliation: Wolfram Research (United States)

Abstract

BackgroundAlignment-free methods of genomic comparison offer the possibility of scaling to large data sets of nucleotide sequences comprised of several thousand or more base pairs. Such methods can be used for purposes of deducing “nearby” species in a reference data set, or for constructing phylogenetic trees.ResultsWe describe one such method that gives quite strong results. We use the Frequency Chaos Game Representation (FCGR) to create images from such sequences, We then reduce dimension, first using a Fourier trig transform, followed by a Singular Values Decomposition (SVD). This gives vectors of modest length. These in turn are used for fast sequence lookup, construction of phylogenetic trees, and classification of virus genomic data. We illustrate the accuracy and scalability of this approach on several benchmark test sets.ConclusionsThe tandem of FCGR and dimension reductions using Fourier-type transforms and SVD provides a powerful approach for alignment-free genomic comparison. Results compare favorably and often surpass best results reported in prior literature. Good scalability is also observed.

Highlights

Alignment-free methods of genomic comparison offer the possibility of scaling to large data sets of nucleotide sequences comprised of several thousand or more base pairs
It will scale to large sets; the most computationally intensive step is the Singular Values Decomposition (SVD), and that is mitigated by the fact that the number of columns is limited by the Discrete Fourier Cosine Transform (DCT) components retained, and we only compute at most a few dozen singular values
As in the microbe set, we create FGCR images at a pixelation level of 7, we keep a 30 × 30 matrix at the DCT step, and we reduce to dimension 40 at the SVD step

Summary

Results

Microbial genomes The first test of this method was on fragments of length 20000 bp from the training and test sets of microbial species in [37]. There are 23384 training sequences and 14339 test sequences It took 16 min to read in and process all training and test genomes through the DCT step, 7 s to do the SVD step on the training vectors, and 3 s to use the resulting right multiplier matrix to put the test vectors into the correct dimension and compute the nearest neighbors for all the test vectors. The best method in [28] found the correct identification for just over 67% ([28] reports that the top subsequence BLAST hit correctly identified the species for roughly 83% of the test sequences).

Conclusions

Method