Abstract

Prediction of the effect of a single-nucleotide variant (SNV) in an intronic region on aberrant pre-mRNA splicing is challenging except for an SNV affecting the canonical GU/AG splice sites (ss). To predict pathogenicity of SNVs at intronic positions −50 (Int-50) to −3 (Int-3) close to the 3’ ss, we developed light gradient boosting machine (LightGBM)-based IntSplice2 models using pathogenic SNVs in the human gene mutation database (HGMD) and ClinVar and common SNVs in dbSNP with 0.01 ≤ minor allelic frequency (MAF) < 0.50. The LightGBM models were generated using features representing splicing cis-elements. The average recall/sensitivity and specificity of IntSplice2 by fivefold cross-validation (CV) of the training dataset were 0.764 and 0.884, respectively. The recall/sensitivity of IntSplice2 was lower than the average recall/sensitivity of 0.800 of IntSplice that we previously made with support vector machine (SVM) modeling for the same intronic positions. In contrast, the specificity of IntSplice2 was higher than the average specificity of 0.849 of IntSplice. For benchmarking (BM) of IntSplice2 with IntSplice, we made a test dataset that was not used to train IntSplice. After excluding the test dataset from the training dataset, we generated IntSplice2-BM and compared it with IntSplice using the test dataset. IntSplice2-BM was superior to IntSplice in all of the seven statistical measures of accuracy, precision, recall/sensitivity, specificity, F1 score, negative predictive value (NPV), and matthews correlation coefficient (MCC). We made the IntSplice2 web service at https://www.med.nagoya-u.ac.jp/neurogenetics/IntSplice2.

Highlights

  • RNA splicing is an essential process to generate mature mRNAs from precursor mRNAs, especially in higher eukaryotes (Crick, 1979)

  • We evaluated the performance of IntSplice2 models by fivefold cross-validation (CV) with the area under the receiver operating characteristic curve (AUROC) and the area under the precision/recall curve (AUPR), as well as with seven statistical measures composed of accuracy, precision, recall/sensitivity, specificity, F1 score, negative predictive value (NPV), and matthews correlation coefficient (MCC), which were recommended in the Human Mutation guidelines (Vihinen, 2013; Grimm et al, 2015)

  • We found that common single-nucleotide variant (SNV) with 0.01 ≤ minor allelic frequency (MAF) < 0.50 gave rise to better scores in seven out of nine statistical measures than those with 0.01 ≤ MAF < 0.99 (Supplementary Table 3)

Read more

Summary

Introduction

RNA splicing is an essential process to generate mature mRNAs from precursor mRNAs, especially in higher eukaryotes (Crick, 1979). In the spliceosomal E complex at the first stage of splicing, U1 snRNP binds to the 5’ splice sites (ss) spanning the “GU” dinucleotide; SF1 binds to the branch point sequence (BPS); U2AF65 binds to the polypyrimidine tract (PPT); U2AF35 binds to the. IntSplice to Evaluate Intronic SNVs intron/exon boundary spanning the “AG” dinucleotide; and accessory splicing factors like serine–arginine-rich splicing factors (SRSFs) and heterologous nuclear ribonucleoproteins (hnRNPs) bind to their cognate exonic/intronic sequences (Ohno et al, 2018). We developed IntSplice using newly available SNV datasets and light gradient boosting machine (LightGBM) (Ke et al, 2017), which is a free and open-source distributed GB framework that uses tree-based learning algorithms

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call