Abstract

In silico tools have been developed to predict variants that may have an impact on pre-mRNA splicing. The major limitation of the application of these tools to basic research and clinical practice is the difficulty in interpreting the output. Most tools only predict potential splice sites given a DNA sequence without measuring splicing signal changes caused by a variant. Another limitation is the lack of large-scale evaluation studies of these tools. We compared eight in silico tools on 2959 single nucleotide variants within splicing consensus regions (scSNVs) using receiver operating characteristic analysis. The Position Weight Matrix model and MaxEntScan outperformed other methods. Two ensemble learning methods, adaptive boosting and random forests, were used to construct models that take advantage of individual methods. Both models further improved prediction, with outputs of directly interpretable prediction scores. We applied our ensemble scores to scSNVs from the Catalogue of Somatic Mutations in Cancer database. Analysis showed that predicted splice-altering scSNVs are enriched in recurrent scSNVs and known cancer genes. We pre-computed our ensemble scores for all potential scSNVs across the human genome, providing a whole genome level resource for identifying splice-altering scSNVs discovered from large-scale sequencing studies.

Highlights

  • Since pre-mRNA splicing was first discovered in the 1970s [1,2], DNA variations that disrupt normal splicing have been linked to human genetic diseases [3,4,5]

  • Positive variants were downloaded from three databases: (i) the Human Gene Mutation Database (HGMD) Professional Version 2013.1, which contains more than 13 000 mutations with consequences for mRNA splicing [8]; most are located at invariant GT-AG sites, while the remaining sites are mostly exonic; (ii) the SpliceDisease database, which collects and curates experimentally supported data of RNA splicing mutations and disease [20]; this database integrated 2337 splicing mutation-disease entries, including 303 genes and 370 human diseases from 898 publications; and (iii) the Database for Aberrant Splice Sites (DBASS), which contains 577 and 307 records of mutation-induced and diseasecausing aberrant 5 and 3 splice sites, respectively [21]

  • After filtering data according to our inclusion and exclusion criteria, 1164 unique splice-altering scSNVs within 408 genes from three databases constituted our positive group, among which 790 were from HGMD [8], 266 from the SpliceDisease database [20] and 108 from DBASS [21]

Read more

Summary

Introduction

Since pre-mRNA splicing was first discovered in the 1970s [1,2], DNA variations that disrupt normal splicing have been linked to human genetic diseases [3,4,5]. Unlike nonsynonymous mutations within coding regions that directly alter amino acids by changing the codon, splice-altering mutations influence the normal process of removing introns from the pre-mRNA and rejoining the remaining exons. This normal process is regulated by complicated mechanisms that usually result in the production of different proteins by exon skipping, intron retention, use of different 5 or 3 splice sites, etc. A recent study sequenced the whole genome of 962 individuals and identified a total of more than 25 million genetic variants [11] This provides us with opportunities to discover novel causal variants and makes the prioritization of these newly identified variants more challenging in view of the infeasibility of confirming each variant in vivo/in vitro

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.