Abstract

MotivationNext Generation Sequencing (NGS) is a frequently applied approach to detect sequence variations between highly related genomes. Recent large-scale re-sequencing studies as the Human 1000 Genomes Project utilize NGS data of low coverage to afford sequencing of hundreds of individuals. Here, SNPs and micro-indels can be detected by applying an alignment-consensus approach. However, computational methods capable of discovering other variations such as novel insertions or highly diverged sequence from low coverage NGS data are still lacking.ResultsWe present LOCAS, a new NGS assembler particularly designed for low coverage assembly of eukaryotic genomes using a mismatch sensitive overlap-layout-consensus approach. LOCAS assembles homologous regions in a homology-guided manner while it performs de novo assemblies of insertions and highly polymorphic target regions subsequently to an alignment-consensus approach. LOCAS has been evaluated in homology-guided assembly scenarios with low sequence coverage of Arabidopsis thaliana strains sequenced as part of the Arabidopsis 1001 Genomes Project. While assembling the same amount of long insertions as state-of-the-art NGS assemblers, LOCAS showed best results regarding contig size, error rate and runtime.ConclusionLOCAS produces excellent results for homology-guided assembly of eukaryotic genomes with short reads and low sequencing depth, and therefore appears to be the assembly tool of choice for the detection of novel sequence variations in this scenario.

Highlights

  • Since the introduction of the first Generation Sequencing (NGS) technology in 2005, the throughput and cost-efficiency of sequencing has greatly increased and continues to do so

  • We present LOCAS, a new Next Generation Sequencing (NGS) assembler designed for low coverage assembly of eukaryotic genomes using a mismatch sensitive overlap-layout-consensus approach

  • LOCAS has been evaluated in homology-guided assembly scenarios with low sequence coverage of Arabidopsis thaliana strains sequenced as part of the Arabidopsis 1001 Genomes Project

Read more

Summary

Introduction

Since the introduction of the first Generation Sequencing (NGS) technology in 2005, the throughput and cost-efficiency of sequencing has greatly increased and continues to do so. To afford genome sequencing of hundreds to thousands of individuals large-scale re-sequencing projects like the Human 1000 Genomes Project utilize low coverage sequencing to a depth of less than 56[1], followed by mapping of reads to a known reference genome from the same species. This alignment-consensus approach, used by e.g. SOAPsnp [2], VAAL [3], MAQ [4], Pyrobayes [5], SHORE [6] or SHRiMP [7], is capable to detect sequence variants like single-nucleotide polymorphisms (SNPs) or small insertions or deletions (micro-indels) [8,9]. Various approaches to estimate copy number variants and other large rearrangements (commonly referred to as structural variants) [4,6,10] from read quantity or mate-pair data have been introduced but these strategies do not reveal additional sequence information

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call