Background A central aim of precision medicine is to target treatments to the underlying causes of disease. To accurately target treatments we must be able to recognize pathogenic genetic variants. Current methods prioritize variants that directly alter protein sequence (missense and loss of function) but not variants that may cause disease by changing the processing of final transcripts. The difficulty in capturing this effect results in overlooking synonymous and intronic variants when searching for disease risk in sequenced genomes. Methods The TraP score was constructed using three main components: 1) Information acquisition – details of the harboring gene and it’s transcripts are gathered for each variant. 2) Feature calculation - possible changes to sequence motifs are evaluated, including changes to exon-intron boundaries, creation of cryptic splice sites, creations and disruptions of cis-acting binding sites for splicing regulatory proteins, interactions between selected features such as original and new splice sites and others. Overall, 42 features and 14 general properties (chromosome, strand, coordinate, etc.) are collected for each variant. 3) Modeling – the incorporation of selected features into a random forest model. The model is trained on a set of 75 pathogenic synonymous variants and 402 benign variants. Pathogenic variants are strongly associated with rare disease, whereas the 402 benign variants are de novo mutations identified from healthy individuals. Results The Transcript-inferred Pathogenicity score (TraP) presented here was constructed to reliably identify non-coding mutations that cause disease. Trap is strongly negatively correlated with allele frequency in both synonymous and intronic regions, suggesting that the higher the TraP score the stronger the selection against these variants in the population. Moreover, synonymous variants with high TraP scores have significantly lower minor allele frequencies than even missense variants, indicating that Trap identifies a subset of synonymous variants under stronger purifying selection. TraP identifies known pathogenic variants in synonymous and intronic ClinVar datasets (AUC = 0.88 and 0.83, respectively), dismissing benign variants with extremely high specificity of above 99%. Applied to exomes of 281 epilepsy family trios, TraP pinpoints synonymous de novo variants in known epilepsy genes. TraP’s high performance and specificity clearly outperforms existing methods and allows the prioritization of synonymous and intronic variants for use in gene discovery and the interpretation of personal genomes. Discussion Exome sequencing studies consider rare non-synonymous variants as disease candidates, while other variant types are mostly ignored. Some existing methods are able to prioritize synonymous and intronic variants, yet lack the specificity required for detection of causal variants. TraP discards over 99% of non-coding variants as benign while strongly identifying true pathogenic variants. TraP identifies pathogenic variants that are not conserved, yet have rare population frequencies. Doing so without prior population frequency information and in contrast to the GERP++ and CADD scores, suggests that TraP identifies pathogenic events that were not selected against during vertebrate evolution, but are selected against in human population. This conclusion is supported by the highest complexity of alternative splicing found in primates and by the species-specific nature of splicing regulation.
Read full abstract