Real-time analysis for Nanopore sequencing data

Hoang Son Nguyen

doi:10.14264/uql.2019.642

Abstract

The introduction of third-generation sequencing technologies presents many challenges to the traditional methods in which bioinformatics is applied to genomics data. The MinION thumb-drive sequencing device has gained great attention from researchers worldwide. Despite the relatively high error rates compared to Second Generation Sequencing technologies, the nanopore approach is widely considered to be a turning point due to its ability to decode much longer reads (tens or even hundreds of kilobase pairs) in a real-time fashion.Prior to the work presented in this thesis, there was no available method that could scaffold and finish assemblies in real-time while the nanopore sequencing run is still in progress. Such a method is desirable because it offers the opportunity to obtain analysis results as soon as sufficient data are generated. With real-time analysis, answers to questions of interests could be obtained in situ, in an automated manner that saves a considerable amount of time and resources compared to the conventional approach of sending the sample to a sequencing center, waiting for bulk data, and conducting a batch analysis. On top of that, streaming analysis can help to avoid under- and over-sequencing which could result in either the generation of more sequence data than required at greater cost or a low-quality assembly if insufficient data are generated.For the aforementioned reasons, the motivation of this thesis project is to develop methodologies for streaming data analysis of long reads for real-time finishing genome sequences. As the initial result, in Chapter 2, I introduce npScarf which can scaffold and complete short read assemblies alongside with the long read sequencing run. This tool operates on an input of contigs, attempting to bridge them together by using long-read data and reports assembly metrics in real-time so the sequencing run can be terminated once an assembly of sufficient quality is obtained.It is also desirable to extend the pipeline application for multiple samples at the same time, through a parallel mechanism known as barcoded sequencing. In Chapter 3, npBarcode, a tool supporting real-time demultiplexing of nanopore sequencing data, is employed to serve that purpose. Depending on requirements, users can choose to run the dedicated demultiplexer from the command line or using it as part of npReader's GUI. The tool provides practitioners a flexible option to monitor a barcoded sequencing run as well as to integrate pooled sequencing into a streaming analysis pipeline. For example, in combination with npScarf, we can complete multiple genomes in parallel.Users can also provide an underlying assembly graph structure from short-read assemblers for better quality. This approach is described in Chapter 4. In which, a streaming algorithm is implemented together with a graphical user interface (GUI) in npGraph. The benefits of using assembly graph stems from the fact that by traversing the graph of the contigs' building blocks, we can reduce the number of mis-assemblies and errors in the final sequences.Chapter 5 discusses another application of nanopore sequencing for decoding small genomes. By employing rolling circle amplification, long reads containing multiple copies of a given viral genome can be obtained. The duplicated patterns are possibly identified by a detection module that can even work with raw signal data. The developed modules, which allow for single-molecule genome assembly of small genomes, can be integrated in a streaming pipeline for real-time analyses as well.In summary, my thesis project aims to develop and apply in-house tools that aid genome assembly and analysis in real-time in an attempt to facilitate the applications of nanopore sequencing for various use cases, including but not limited to microbial genomics.

Full Text