Optimizing k-mer size using a variant grid search to enhance de novo genome assembly.

Soyeon Cha,David Mckbird,

doi:10.6026/97320630012036

Abstract

Largely driven by huge reductions in per-base costs, sequencing nucleic acids has become a near-ubiquitous technique in laboratories performing biological and biomedical research. Most of the effort goes to re-sequencing, but assembly of de novogenerated, raw sequence reads into contigs that span as much of the genome as possible is central to many projects. Although truly complete coverage is not realistically attainable, maximizing the amount of sequence that can be correctly assembled into contigs contributes to coverage. Here we compare three commonly used assembly algorithms (ABySS, Velvet and SOAPdenovo2), and show that empirical optimization of k-mer values has a disproportionate influence on de novo assembly of a eukaryotic genome, the nematode parasite Meloidogynechitwoodi. Each assembler was challenged with about 40 million Iluumina II paired-end reads, and assemblies performed under a range of k-mer sizes. In each instance, the optimal k-mer was 127, although based on N50 values,ABySS was more efficient than the others. That the assembly was not spurious was established using the “Core Eukaryotic Gene Mapping Approach”, which indicated that 98.79% of the M. chitwoodi genome was accounted for by the assembly. Subsequent gene finding and annotation are consistent with this and suggest that k-mer optimization contributes to the robustness of assembly.

Highlights

The progression of technology from Sanger sequencing to the current “next-generation” platforms has heralded striking reductions in the cost of generating data
FASTQ files were obtained from the European Nucleotide Archive (ENA) and were de novo assembled by ABySSusing k-mer sizes chosen by KmerGenie and Velvet advisor as well as by our empirical methods
Results &Discussion: Illumina sequencing yielded a total of 42,011,068 paired-end sequence reads (21,005,534 from each end), occupying 27.5 gigabytes in FASTQ format

Summary

Introduction

The progression of technology from Sanger sequencing to the current “next-generation” platforms has heralded striking reductions in the cost of generating data. Sequencing comes in two forms, distinguished by their needs for assembly into a contiguous reconstruction of a larger molecule. Most prevalent are various forms of “re-sequencing” in which the sequencing reads are aligned with a reference genome to reveal bases polymorphic between samples. The other mode is the assembly of de novo-generated, raw sequence reads into contigs that are, as close as possible a full accounting of the genome of the organism in question. Reference-free assembly is based on stacking overlapping sequences of genomic fragments of a defined size (the k-mer), generated by breaking each read into k-mer size. We examined three commonly used assembly platforms, and showed that optimization of k-mer values has a disproportionate influence on de novo assembly of a eukaryotic genome

Objectives

Methods

Results

Conclusion