Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models

Mustafa Abdallah,Ashraf Mahgoub,Hany Ahmed,Somali Chaterji

doi:10.1038/s41598-019-52196-4

Mustafa Abdallah, Ashraf Mahgoub + Show 2 more

Open Access

https://doi.org/10.1038/s41598-019-52196-4

Copy DOI

Abstract

The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the “perplexity” metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions—for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena’s selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.

Highlights

IntroductionThe majority of error correction tools share the following intuition: high-fidelity sequences (or, solid sequences) can be used to correct errors in low-fidelity sequences (or, in-solid sequences)
Error correction and evaluation.The majority of error correction tools share the following intuition: high-fidelity sequences can be used to correct errors in low-fidelity sequences
We start by randomly collecting 100 K short reads from the reference genome for two organisms used in the real datasets–E. coli (D1, D2) and Acinetobacter (D3)

Summary

Introduction

The majority of error correction tools share the following intuition: high-fidelity sequences (or, solid sequences) can be used to correct errors in low-fidelity sequences (or, in-solid sequences). They vary significantly in the way they differentiate between solid and in-solid sequences. For example[4], corrects genomic reads containing insolid k-mers using a minimum number of edit operations such that these reads contain only solid k-mers after correction. The evaluation of de novo sequencing techniques rely on likelihood-based metrics such as ALE15 and CGAL16, without relying on the availability of a reference genome. Comparative sequencing or re-sequencing, such as to study structural variations among two genomes, do have reference genomes available

Objectives

Results

Discussion

Conclusion