Abstract

Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.

Highlights

  • To show the effectiveness of the proposed approach, we have performed an experimental comparison with state-of-the-art Machine Learning (ML) methods for two challenging and medically relevant ML prediction problems for imbalanced genomic data: the prediction of regulatory mutations underlying Mendelian diseases[10] and the prediction of non-coding variants associated with GWAS regulatory hits from the GWAS-Catalog[19]

  • Results show that imbalance-aware ML strategies can successfully counteract the bias toward the majority class of state-of-the-art ML methods, and are essential for predicting disease-associated non-coding variants in genomic contexts characterized by a large imbalance between the known small set of available rare and common disease-associated variants and the huge set of all known human genetic variation

  • Of the metrics used to compare hyperSMURF with state-of-the-art scoring methods (AUPRC, Area Under the Receiver Operating Characteristic curve (AUROC), precision, recall and F-measure at different score threshold levels, analysis of the distribution of top-ranked variants associated with traits or diseases), the results show that hyperSMURF achieves significantly better results than the other methods, as detailed

Read more

Summary

Introduction

Imprinting control elements[10]. A better understanding of regulatory variants will be necessary to unravel the functional architecture of rare and common disease. Computational algorithms for the analysis of non-coding deleterious variants are faced with special challenges owing to the rarity of confirmed pathogenic mutations In this setting, classical learning algorithms, such as support vector machines[16] or artificial neural networks[17] tend to generalize poorly, because they usually predict the minority class with very low sensitivity and precision[18]. In the context of the prediction of genetic variants associated with traits or diseases, this boils down to wrongly predicting most of the disease-associated variants as non-disease associated, significantly limiting the usefulness of ML methods for the prediction of novel, disease-associated non-coding variants To address this problem we have developed hyperSMURF, hyper-ensemble of SMOTE Undersampled Random Forests, a method designed for the analysis of imbalanced genomic data. Results show that imbalance-aware ML strategies can successfully counteract the bias toward the majority class of state-of-the-art ML methods, and are essential for predicting disease-associated non-coding variants in genomic contexts characterized by a large imbalance between the known small set of available rare and common disease-associated variants and the huge set of all known human genetic variation

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.