Abstract
Alignment-free methods, more time and memory efficient than alignment-based methods, have been widely used for comparing genome sequences or raw sequencing samples without assembly. However, in this study, we show that alignment-free dissimilarity calculated based on sequencing samples can be overestimated compared with the dissimilarity calculated based on their genomes, and this bias can significantly decrease the performance of the alignment-free analysis. Here, we introduce a new alignment-free tool, Alignment-Free methods Adjusted by Neural Network (Afann) that successfully adjusts this bias and achieves excellent performance on various independent datasets. Afann is freely available at https://github.com/GeniusTang/Afann.
Highlights
With the advent of next-generation sequencing (NGS) technologies, enormous amounts of sequence data are emerging rapidly
Since background-adjusted dissimilarity measures have been shown to outperform other methods for solving different problems ranging from evolutionary distance estimation [14] to virus-host interaction prediction [15], geographic location prediction [12], horizontal gene transfer detection [16], and metagenome and metatranscriptome comparison [10, 17], we focused on the bias adjustment for two background-adjusted dissimilarity measures d2s and d2∗ in this study
We evaluated the performance of Skmer [8] on the same primate dataset using kmer length K = 21 and sketch size s = 107, which is a recent alignmentfree method that corrects the formula of Mash distance based on NGS samples by estimating the sequencing depth and sequencing error rate
Summary
With the advent of next-generation sequencing (NGS) technologies, enormous amounts of sequence data are emerging rapidly. Alignment-based approaches for sequence comparison are generally accurate and powerful, their applications are being challenged by the size of sequence data that increases at an exponential rate. Alignment-free methods, especially kmer-based approaches that use the frequencies of kmers (k-words or k-grams) for sequence comparison can be naturally adapted to shotgun NGS sequencing data without assembly [4, 5, 8,9,10,11,12]. Zielezinski et al [9] published a comprehensive comparison over 74 alignmentfree methods for 5 research applications including cis-regulatory module detection, protein sequence classification, gene tree inference, genome-based phylogeny, and reconstruction of species trees under sequence rearrangements
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have