PRINCESS: comprehensive detection of haplotype resolved SNVs, SVs, and methylation

Medhat Mahmoud,Winston Timp,Harshavardhan Doddapaneni,Fritz J Sedlazeck

doi:10.1186/s13059-021-02486-w

Abstract

Long-read sequencing has been shown to have advantages in structural variation (SV) detection and methylation calling. Many studies focus either on SV, methylation, or phasing of SNV; however, only the combination of variants provides a comprehensive insight into the sample and thus enables novel findings in biology or medicine. PRINCESS is a structured workflow that takes raw sequence reads and generates a fully phased SNV, SV, and methylation call set within a few hours. PRINCESS achieves high accuracy and long phasing even on low coverage datasets and can resolve repetitive, complex medical relevant genes that often escape detection. PRINCESS is publicly available at https://github.com/MeHelmy/princess under the MIT license.

Highlights

Long-read sequencing (LRS) is becoming more broadly available across sequencing centers and smaller academic institutions [1]
PRINCESS consists of multiple stages including (i) initial data quality control, (ii) alignment of the reads, (iii) identification of SNVs and indels, (iv) identification of structural variation (SV), (v) filtering variants, and (vi) phasing of SNVs, indels, and SVs together and (vii) reporting of the results
To ease the use of PRINCESS, we have incorporated preset parameters to optimize the analysis of the three major long-read platforms/technologies being CLR, High Fidelity (HiFi) for PacBio, and Oxford Nanopore (ONT)

Summary

Background

Long-read sequencing (LRS) is becoming more broadly available across sequencing centers and smaller academic institutions [1]. The detection of small variants (SNVs and indels) (typically 1–50 bp), SVs (50+ bp: deletions, duplications, insertions, inversions, and translocations), and methylation differences provide important insights into genomics and genetics [20,21,22]. Each of these genomic variations/alterations have been shown to be important drivers of evolution, diversity, and diseases or phenotypic changes [6, 23, 24]. We highlight PRINCESS’s capability to improve variant identification across 193 medical regions that are difficult to assess with short-read technology [38] that often escapes NGS sequencing [38]

Results

Discussion

Methods