Pacific Biosciences Data Research Articles

The utility of long-read genome sequencing platforms has been shown in many fields including whole genome assembly, metagenomics, and amplicon sequencing. Less clear is the applicability of long reads to reference-guided human genomics, which is the foundation of genomic medicine. Here, we benchmark available platform-agnostic alignment tools on datasets from nanopore and single-molecule real-time platforms to understand their suitability in producing a genome representation. For this study, we leveraged publicly-available data from sample NA12878 generated on Oxford Nanopore and sample NA24385 on Pacific Biosciences platforms. We employed state of the art sequence alignment tools including GraphMap2, long-read aligner (LRA), Minimap2, CoNvex Gap-cost alignMents for Long Reads (NGMLR), and Winnowmap2. Minimap2 and Winnowmap2 were computationally lightweight enough for use at scale, while GraphMap2 was not. NGMLR took a long time and required many resources, but produced alignments each time. LRA was fast, but only worked on Pacific Biosciences data. Each tool widely disagreed on which reads to leave unaligned, affecting the end genome coverage and the number of discoverable breakpoints. No alignment tool independently resolved all large structural variants (1,001-100,000 base pairs) present in the Database of Genome Variants (DGV) for sample NA12878 or the truthset for NA24385. These results suggest a combined approach is needed for LRS alignments for human genomics. Specifically, leveraging alignments from three tools will be more effective in generating a complete picture of genomic variability. It should be best practice to use an analysis pipeline that generates alignments with both Minimap2 and Winnowmap2 as they are lightweight and yield different views of the genome. Depending on the question at hand, the data available, and the time constraints, NGMLR and LRA are good options for a third tool. If computational resources and time are not a factor for a given case or experiment, NGMLR will provide another view, and another chance to resolve a case. LRA, while fast, did not work on the nanopore data for our cluster, but PacBio results were promising in that those computations completed faster than Minimap2. Due to its significant burden on computational resources and slow run time, Graphmap2 is not an ideal tool for exploration of a whole human genome generated on a long-read sequencing platform.

Read full abstract

Abstract Several state-of-the-art, easy to use tools are available both for short-variant detection (e.g. GATK, FREEBAYES), and structural variant (SV) detection (e.g. LUMPY, MANTRA, DELLY), but these tools often produce divergent variant calls, especially INDELs, and it is very difficult to reconcile such variants into a single, accurate set. Furthermore, while it would be highly desirable to also detect larger, structural variants (SV), existing SV detector packages are typically difficult to integrate, highly resource-intensive to run, and result in call sets that require expert manual review to reduce false positive detection rate. Our algorithm, GRAPHITE (https://github.com/dillonl/graphite) requires as input a collection of variant calls, made by one or more short-variant or SV detection tools. Typically, this starting set is high sensitivity (i.e. inclusive), but low specificity (i.e. have a high false discovery rate). We then apply a novel “variant adjudication” procedure to discard false positives, while keeping true positive calls. This is accomplished by constructing a graph from these variants (the Variant Graph) representing allelic variants as graph branches, in addition to the branches formed by the current, linear genome reference sequence. Using a graph mapping algorithm (GSSW, a graph extension of the Smith-Waterman alignment algorithm) we developed earlier, we re-map all reads from each of the samples contributing to the candidate calls. We retain candidate variants confirmed by mappings to those branches in the graph that represent the corresponding variant allele, and discard those candidates that were not confirmed by such mappings. This procedure results in a highly specific callset that also maintains the high sensitivity of the inclusive starting callset constructed by multiple primary variant calling methods. Because the graph construction and mapping approach works for most types of SVs in addition to all short variants, variants of all different types can be integrated in a single step. Here we present the application of this method for cross-validating structural variants calls from Pacific Biosciences data by remapping deep Illumina WGS read sets to Variant Graphs constructed using the candidate Pacific Biosciences variants, as part of the Human Genome Structural Variation Consortium (HGSVC) data analysis project. We also present GRAPHITE's application to improving the accuracy of allele frequency measurement in tumor sequencing data, which is essential for the accurate reconstruction of subclonal evolution in longitudinal tumor samples. Citation Format: Dillon Lee, Yi Qiao, Gabor Marth. A graph remapping framework for in silico adjudication of SNVs, INDELs, and structural variants from genetic sequencing data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 3272.

Read full abstract

Pacific Biosciences Data Research Articles

Articles published on Pacific Biosciences Data

Benchmarking long-read genome sequence alignment tools for human genomics applications.

A chromosome-level genome assembly of the beet armyworm Spodoptera exigua

Accurate isoform discovery with IsoQuant using long reads

A chromosome-level genome assembly for the eastern fence lizard (Sceloporus undulatus), a reptile model for physiological and evolutionary ecology.

Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing

HLA*LA-HLA typing from linearly projected graph alignments.

Abstract 3272: A graph remapping framework for in silico adjudication of SNVs, INDELs, and structural variants from genetic sequencing data

Comparison of genome sequencing technology and assembly methods for the analysis of a GC-rich bacterial genome.

ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies

A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers

Pacific biosciences sequencing technology for genotyping and variation discovery in human data

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Pacific Biosciences Data Research Articles

Articles published on Pacific Biosciences Data

Benchmarking long-read genome sequence alignment tools for human genomics applications.

A chromosome-level genome assembly of the beet armyworm Spodoptera exigua

Accurate isoform discovery with IsoQuant using long reads

A chromosome-level genome assembly for the eastern fence lizard (Sceloporus undulatus), a reptile model for physiological and evolutionary ecology.

Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing

HLA*LA-HLA typing from linearly projected graph alignments.

Abstract 3272: A graph remapping framework for in silico adjudication of SNVs, INDELs, and structural variants from genetic sequencing data

Comparison of genome sequencing technology and assembly methods for the analysis of a GC-rich bacterial genome.

ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies

A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers

Pacific biosciences sequencing technology for genotyping and variation discovery in human data