Assembly and Data Quality

Christoph Bleidorn

doi:10.1007/978-3-319-54064-1_5

Abstract

Methods to assemble sequence reads into larger pieces are described. In many cases, the raw data of sequencing machines are pictures, which are translated in a subsequent analysis step (base calling) into sequence reads. Each position of a sequence read receives a quality score, indicating the probability of a sequencing error. After quality filtering and trimming of adapter regions or barcoding indices, these reads can be assembled de novo into larger pieces. Basically three different types of assembly strategies are in use: greedy algorithms, overlap-layout-consensus assemblers and methods relying on k-mer graphs. Overlapping reads producing contiguous sequences are named contigs. Positional information from paired-end reads or mate pairs can be used to order contigs into scaffolds. In the ideal case of genome sequencing, the number of scaffolds would equal the number of expected chromosomes. Several statistics can be used to describe or compare different sequence assemblies. Generally, a diversity of programs and chosen parameters should be explored to find the best assembly. Different strategies are used for genome, transcriptome and metagenome assemblies, and all of them greatly benefit from the inclusion of long reads. Assembly methods are becoming an increasingly important tool for everybody working with sequence data, since the vast majority of published sequence data in NCBI GenBank is deposited as short reads in the sequence read archive (► http://www.ncbi.nlm.nih.gov/sra/). This data is usually not directly searchable by methods like BLAST and needs to be assembled for subsequent analysis.

Full Text