Comparison of Short-Read Sequence Aligners Indicates Strengths and Weaknesses for Biologists to Consider.

Ryan Musich,Michael V Osier,Lance Cadle-Davidson

doi:10.3389/fpls.2021.657240

Ryan Musich, Michael V Osier + Show 1 more

Open Access

https://doi.org/10.3389/fpls.2021.657240

Copy DOI

Abstract

Aligning short-read sequences is the foundational step to most genomic and transcriptomic analyses, but not all tools perform equally, and choosing among the growing body of available tools can be daunting. Here, in order to increase awareness in the research community, we discuss the merits of common algorithms and programs in a way that should be approachable to biologists with limited experience in bioinformatics. We will only in passing consider the effects of data cleanup, a precursor analysis to most alignment tools, and no consideration will be given to downstream processing of the aligned fragments. To compare aligners [Bowtie2, Burrows Wheeler Aligner (BWA), HISAT2, MUMmer4, STAR, and TopHat2], an RNA-seq dataset was used containing data from 48 geographically distinct samples of the grapevine powdery mildew fungus Erysiphe necator. Based on alignment rate and gene coverage, all aligners performed well with the exception of TopHat2, which HISAT2 superseded. BWA perhaps had the best performance in these metrics, except for longer transcripts (>500 bp) for which HISAT2 and STAR performed well. HISAT2 was ~3-fold faster than the next fastest aligner in runtime, which we consider a secondary factor in most alignments. At the end, this direct comparison of commonly used aligners illustrates key considerations when choosing which tool to use for the specific sequencing data and objectives. No single tool meets all needs for every user, and there are many quality aligners available.

Highlights

Sequence aligning tools, which determine where small sequence fragments align to a larger, “reference” genome or transcriptome sequences are an essential part of any toolkit for modern whole genome and transcriptome analyses
Determining a fragment’s location in the reference allows for diverse applications, Comparison of Short-Read Sequence Aligners ranging from agricultural benefits like identifying how abiotic stresses can protect a crop from a fungus (Weldon et al, 2019) to discovering vulnerabilities and susceptibilities in a novel human virus such as COVID-19 (Kim et al, 2020)
For each of the 48 samples, the alignment rate was tracked for all aligners used, which represents the percentage of sequenced reads that were successfully mapped to the reference genome

Summary

Introduction

Sequence aligning tools, which determine where small sequence fragments align to a larger, “reference” genome or transcriptome sequences are an essential part of any toolkit for modern whole genome and transcriptome analyses. Effective for indexing, suffix trees are known in the computing world to require a large amount of memory for their creation, with the human genome needing roughly 45 GB of space in suffix tree form (Kurtz et al, 2004). This large memory usage was a major drawback for early aligners as these tools would struggle to run on even today’s computers and would be reserved for use on research servers only. Reducing memory usage was the major goal of future tools and resulted in the use of the FM-Index as the major data structure being used by most of today’s aligners

Methods

Results

Conclusion