GRASP2: Fast and memory-efficient gene-centric assembly and homolog search

Cuncong Zhong,Shibu Yooseph,Youngik Yang

doi:10.1109/iccabs.2017.8114296

Abstract

A crucial task for metagenomic analysis is to annotate the function and taxonomy of the sequencing reads generated from a microbiome sample. In general, the reads can either be assembled into contigs and searched against reference databases, or individually searched without assembly. The first approach may suffer due to the fragmentary and incomplete nature of nucleotide sequence assembly, while the second approach is hampered by the reduced functional signal that a short read can contain. To tackle these issues, we previously developed GRASP (Guided Reference-based Assembly of Short Peptides), which accepts a reference protein sequence as input and aims to assemble its homologs from a database containing fragmentary protein sequences. In addition to a gene-centric assembly tool, GRASP also serves as a homolog search tool when using the assembled protein sequences as templates to recruit reads. GrASP has significantly improved sensitivity (60–80% vs. 30–40%) compared to other homolog search tools such as BLAST. However, GRASP is time- and space-consuming compared to these tools, and is not scalable to large datasets. Subsequently, we developed GRASPx which is 30X faster than GRASP. Here, we present a completely redesigned algorithm, GRASP2, for this computational problem. GRASP2 utilizes Burrow-Wheeler Transformation (BWT) to assist with assembly graph generation, and reduces the search space by employing a fast ungapped alignment strategy to reduce unnecessary traversal of non-homologous paths in the assembly graph. GRASP2 is 8-fold faster than GRASPx (and 250-fold faster than GRASP) and uses 8-fold less memory while maintaining the original high sensitivity of GRASP, which makes GRASP2 a useful tool for metagenomics data analysis. GRASP2 is implemented in C++ and is freely available from http://www.sourceforge.net/projects/grasp2.

Full Text