SWeeP: representing large biological sequences datasets in compact vectors

Camilla Reginatto De Pierri,Mariane Gonçalves Kulik,J Miguel Ortega,Jeroniza Nunes Marchaukoski,Antonio Camilo Da Silva Filho,Ricardo Voyceik,Fabio O Pedrosa,Bruno Thiago De Lima Nichio,Roberto Tadeu Raittz,Josué Oliveira Camargo,Aryel Marlus Repula De Oliveira,Dieval Guizelini,Letícia Graziela Costa Santos De Mattos

doi:10.1038/s41598-019-55627-4

Camilla Reginatto De Pierri, Mariane Gonçalves Kulik + Show 11 more

Open Access

https://doi.org/10.1038/s41598-019-55627-4

Copy DOI

Abstract

Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/.

Highlights

Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data
We propose SWeeP, a method that handles large data sets, reducing computational costs while ensuring the quality of gene product analysis. It is based on the vector representation of protein sequences as a compact model based on the projection of k-mers sets onto a randomly oriented quasi-orthonormal base, with a sufficient number of coordinates to maintain intersequence comparisons
Note that we propose v ≪ u (u = 160,000 and v = 600 in the cases studied in this paper), and that Single Value Decomposition (SVD) of B be computationally simpler than set of vectors of length u e.g. W

Summary

Introduction

Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Several studies have successfully used alignment-free methods for the comparative analyses of complete genomes and other large biological sequence data sets[4,5,6,7,8,9,10,11,12,13], but the investigation of these techniques is still necessary to ascertain their effectiveness. It is based on the vector representation of protein sequences as a compact model based on the projection of k-mers sets onto a randomly oriented quasi-orthonormal base, with a sufficient number of coordinates to maintain intersequence comparisons.

Results

Conclusion