Abstract

Alignment-free methods, more time and memory efficient than alignment-based methods, have been widely used for comparing genome sequences or raw sequencing samples without assembly. However, in this study, we show that alignment-free dissimilarity calculated based on sequencing samples can be overestimated compared with the dissimilarity calculated based on their genomes, and this bias can significantly decrease the performance of the alignment-free analysis. Here, we introduce a new alignment-free tool, Alignment-Free methods Adjusted by Neural Network (Afann) that successfully adjusts this bias and achieves excellent performance on various independent datasets. Afann is freely available at https://github.com/GeniusTang/Afann.

Highlights

  • With the advent of next-generation sequencing (NGS) technologies, enormous amounts of sequence data are emerging rapidly

  • Since background-adjusted dissimilarity measures have been shown to outperform other methods for solving different problems ranging from evolutionary distance estimation [14] to virus-host interaction prediction [15], geographic location prediction [12], horizontal gene transfer detection [16], and metagenome and metatranscriptome comparison [10, 17], we focused on the bias adjustment for two background-adjusted dissimilarity measures d2s and d2∗ in this study

  • We evaluated the performance of Skmer [8] on the same primate dataset using kmer length K = 21 and sketch size s = 107, which is a recent alignmentfree method that corrects the formula of Mash distance based on NGS samples by estimating the sequencing depth and sequencing error rate

Read more

Summary

Introduction

With the advent of next-generation sequencing (NGS) technologies, enormous amounts of sequence data are emerging rapidly. Alignment-based approaches for sequence comparison are generally accurate and powerful, their applications are being challenged by the size of sequence data that increases at an exponential rate. Alignment-free methods, especially kmer-based approaches that use the frequencies of kmers (k-words or k-grams) for sequence comparison can be naturally adapted to shotgun NGS sequencing data without assembly [4, 5, 8,9,10,11,12]. Zielezinski et al [9] published a comprehensive comparison over 74 alignmentfree methods for 5 research applications including cis-regulatory module detection, protein sequence classification, gene tree inference, genome-based phylogeny, and reconstruction of species trees under sequence rearrangements

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call