Abstract

SummaryWhole-genome sequencing resolves many clinical cases where standard diagnostic methods have failed. However, at least half of these cases remain unresolved after whole-genome sequencing. Structural variants (SVs; genomic variants larger than 50 base pairs) of uncertain significance are the genetic cause of a portion of these unresolved cases. As sequencing methods using long or linked reads become more accessible and SV detection algorithms improve, clinicians and researchers are gaining access to thousands of reliable SVs of unknown disease relevance. Methods to predict the pathogenicity of these SVs are required to realize the full diagnostic potential of long-read sequencing. To address this emerging need, we developed StrVCTVRE to distinguish pathogenic SVs from benign SVs that overlap exons. In a random forest classifier, we integrated features that capture gene importance, coding region, conservation, expression, and exon structure. We found that features such as expression and conservation are important but are absent from SV classification guidelines. We leveraged multiple resources to construct a size-matched training set of rare, putatively benign and pathogenic SVs. StrVCTVRE performs accurately across a wide SV size range on independent test sets, which will allow clinicians and researchers to eliminate about half of SVs from consideration while retaining a 90% sensitivity. We anticipate clinicians and researchers will use StrVCTVRE to prioritize SVs in probands where no SV is immediately compelling, empowering deeper investigation into novel SVs to resolve cases and understand new mechanisms of disease. StrVCTVRE runs rapidly and is publicly available.

Highlights

  • Whole-genome sequencing (WGS) can identify causative variants in clinical cases that elude other diagnostic methods.[1]

  • To model coding sequence (CDS) disruptions, we used three coding features: percentage of the CDS overlapped by the structural variants (SVs), distance from the CDS start to the nearest position in the SV, and distance from the CDS end to the nearest position in the SV

  • We evaluated the predictive ability of transcript consequence reported by Variant Effect Predictor (VEP) (AUC 1⁄4 0.47; 95% confidence intervals (CIs): 0.42–0.52), and we found it performed no better than random

Read more

Summary

Introduction

Whole-genome sequencing (WGS) can identify causative variants in clinical cases that elude other diagnostic methods.[1] As the price of WGS falls and it is used more frequently, researchers and clinicians will increasingly observe structural variants (SVs) of unknown significance. SVs are a heterogeneous class of genomic variants that include copy-number variants such as duplications and deletions, rearrangements such as inversions, and mobile element insertions. While a typical short-read WGS study finds 5,000–10,000 SVs per human genome, long-read WGS is able to identify more than 20,000 with much greater reliability.[2,3,4] This is two orders of magnitude fewer than the $3 million single-nucleotide variants (SNVs) identified in a typical WGS study. Despite their relatively small number, SVs play a disproportionately large role in genetic disease and are of great interest to clinical geneticists and researchers.[5,6]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call