The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome.

Meznah Almutairy,Eric Torng

doi:10.1371/journal.pone.0179046

Abstract

One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.

Highlights

We study the problem of trying to find the best sampling strategy to create simultaneously efficient and accurate k-mer indexes
We focus on continuous seeds because they minimize the number of false positives without compromising retention rate when searching for highly similar local alignments (HSLA)
We study the effects of using BLAST with soft sampling when searching for HSLAs and mapping ESTs onto the human genome

Summary

Introduction

We study the problem of trying to find the best sampling strategy to create simultaneously efficient and accurate k-mer indexes. The Effects of sampling on the efficiency and accuracy of k−mer indexes process of searching for all highly similar local alignments (HSLAs) between a query sequence and a database of sequences. This is a fundamental operation for a wide variety of biological applications including homologous search [1,2,3,4], detection of single nucleotide polymorphisms (SNP) [5,6,7], and mapping cDNA sequences against the corresponding genome [8,9,10]. The HSLAs are commonly used in applications that compare sequences within the same species or closely related species, and we restrict our study’s database to the human genome

Objectives

Results

Conclusion