Abstract

Abstract Next Generation Sequencing (NGS) technology has taken a central role in the diagnosis of genetically-determined cardiovascular diseases. The differentiation of sequencing errors from real variants is a key-point of the genetic testing. In case of novel variants or variants of uncertain significance that may potentially impact on clinical decisions (e.g. ICD implantation in primary cardiomyopathies or preventive surgery in heritable aneurysmal diseases) it is crucial to exclude false positive (FP) and false negative (FN) sequence errors. To date, Sanger Sequencing is the gold standard tool used to confirm and validate NGS-identified variants. While FPs are excluded by the Sanger confirmation and the damage is for wasted costs and time, the FNs are non-resolvable errors because they are undetected and, obviously, not searched in Sanger confirmation, with the risk of missing genetic diagnoses. Purpose This project aimed at reducing NGS errors through the introduction of a bioinformatic solution in the bioinformatic analytic step of the genetic testing process. iEVA is a tool that enhances NGS-derived informative features to use them in a filtering process based on a Machine Learning algorithm (ML). It considers sequencing features (e.g. technical errors, duplicates of PCR) together with nucleotide sequence characteristics. Methods To demonstrate the effectiveness of iEVA in eliminating FP and FN errors from the NGS bioinformatic pipeline, we developed two ML-based filtering algorithms. The training dataset consisted of 7968 single nucleotide variants (SNV) and 306 Insertions and Deletions (InDels) validated by Sanger sequencing performed by expert molecular biologists. Variants derived from 800 sequences obtained with the Illumina Trusight Cardio panel containing 174 genes related to cardiovascular diseases. Two Random Forest classifiers were trained with the task of discriminating between sequencing error and real variant. The first one was trained using attributes derived from the most common variant caller (GATK v3.8), and the second one using iEVA results. To evaluate ML models, we used a 3-Fold cross-Validation and validated the results using an independent validation dataset consisting of 3415 SNV and 132 InDels. Results Using iEVA attributes, we obtained 1 FP (excluded by Sanger) and 3 FN (confirmed by Sanger) less than using common variant caller attributes. In the independent validation dataset, the iEVA-trained classification model identified 1 Sanger-confirmed variant that was missed by variant caller-trained model. Conclusions Variant filtering is crucial to exclude sequencing errors and to recognize true variants. Even a single filtering error may negatively impact on the patient when a genetic diagnosis is missed. To obtain a certain genetic diagnosis, a 0% error probability is needed. The introduction of iEVA in the pipeline is an easy, time- and cost-saving tool to reduce errors and to improve the precision of the genetic data. Acknowledgement/Funding Italian Ministry of Health Research Funding to the IRCCS Foundation University Hospital Policlinico San Matteo of Pavia

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.