Estimation and Modelling of Errors in the Library Preparation Stage of Next Generation Sequencing

Nathan Beka

doi:10.18745/th.23909

Abstract

Next-generation sequencing has empowered genomics by making it possible to sequence genomes at a lower cost and less time compared to the traditional Sanger method. However, these improvements suffer from reduced accuracy when compared with the Sanger method. During the library preparation stage of sequencing, artefacts can be introduced that affect the reliability of a read. These artefacts can arise from biases due to the structure of the genome, such as preferential splitting of DNA between specific nucleotides, bias of adapter ligation towards certain base pair identities, and temperature dependent denaturation due to nucleotide composition. To investigate these issues a library preparation model was developed to simulate the occurrences and investigate effects of such artefacts. The implemented model simulates the DNA fragmentation, adapter ligation and PCR amplification stages of the library preparation process. A set of parameters characterizing these steps and a DNA sequence are used as input and the output is an array of values representing the number of DNA fragments that cover each position of the input sequence (“coverage”). To validate the model a Genetic Algorithm (GA) was used to find parameters that would lead to coverage values that are closely similar to what is found in empirical sequencing data. The GA was able to acquire such parameters for a subsection of the Mycobacterium tuberculosis and Plasmodium falciparum genomes but failed when applied to the TP53 gene of the Homo sapiens genome. From this it was deduced that the model was better at predicting coverage when applied to genomes with subregions of nucleotide repeats. To find the effects of parameters representing each step of the library preparation process the model was applied to a set of in silico generated DNA that represent different sequence structures (GC-rich, AT-rich, neutral composition and a sequence with specific areas of GC and AT rich repeats). My study found that the parameters for the fragmentation, adapter ligation and PCR steps affected coverage. I also found that a combination of parameters between consecutive steps further affected coverage. In the fragmentation step, large fragment size had a negative effect on coverage (p = 0.0), in the adapter ligation step, coverage of AT-rich sequences was affected by a terminal bias (p = 0.0). Modifying parameters for the PCR step affected the coverage of both GC and AT rich sequences due to a temperature dependent bias. Finally, an interaction between the parameters of fragmentation and other steps were found to further reduce coverage. This simulation was able to suggest parameters that need to be fine-tuned to improve coverage.

Full Text