Abstract

Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Single-molecule sequencing (SMS) technologies such as Pacific Biosciences and Oxford Nanopore generate long reads that can potentially address the limitations of short-read sequencing. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, which leverages the haplotype information present in SMS reads to accurately detect and phase single-nucleotide variants (SNVs) in diploid genomes. We demonstrate that Longshot achieves very high accuracy for SNV detection using whole-genome Pacific Biosciences data, outperforms existing variant calling methods, and enables variant detection in duplicated regions of the genome that cannot be mapped using short reads.

Highlights

  • Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome

  • Alignments of Single-molecule sequencing (SMS) reads suffer from reference bias, which can cause an single-nucleotide variants (SNVs) allele to be obscured by gaps in the alignments (Supplementary Fig. 1)

  • Our results demonstrate that highly accurate detection of SNVs is feasible even from long-read sequence data with high error rates

Read more

Summary

Introduction

Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Compared with short-read sequencing technologies such as Illumina, the per-base accuracy of SMS reads is low with an error rate exceeding 10% (primarily due to insertion/deletion errors)[9] This high error rate makes the detection of small sequence variants such as SNVs, heterozygous variants, difficult. Current benchmarks for variant calling in human genomes, developed by the the Genome in a Bottle (GIAB) Consortium[14,15], are based on short-read sequence data and cover $90.8% of the reference human genome sequence These high-confidence variant calls are immensely valuable for developing new variant calling methods and sequencing technologies. The ability to call variants in repetitive regions that are inaccessible to short-read sequencing technologies can advance the use of SMS technologies for detection of disease-causing mutations in duplicated genes via whole-genome or targeted sequencing[17]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call