Abstract

BackgroundSupervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building.ResultsWe demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare Harvestman to existing feature selection methods and demonstrate that our method is more parsimonious—it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier.ConclusionHarvestman is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , Harvestman automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, Harvestman is faster and selects features more parsimoniously.

Highlights

  • Supervised learning from high-throughput sequencing data presents many challenges

  • To further show robustness of the method, we report classification accuracy obtained with three different classifier types, logistic regression (LR) with no regularization, random forest (RF) using 100 trees, and support vector machine (SVM) with radial basis function kernel

  • We have introduced Harvestman, a new approach to supervised hierarchical feature selection, and demonstrated it on our knowledge graphs built from high-throughput sequence data

Read more

Summary

Introduction

Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. Variant calls may not be the optimal encoding for a given learning task, which contributes to poor predictive capabilities To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. Introduction Supervised learning from high-throughput sequencing data presents many challenges [1, 2] First among these is the curse of dimensionality, which predisposes learning algorithms to overfitting and imposes barriers to scalability [3]. The most informative, and biologically relevant encoding of a given variant may be at a higher level of organization, such as a perturbation in a particular exon, transcript, or pathway This paper addresses both challenges by introducing Harvestman, a method that automatically identifies a non-redundant set of relevant features chosen from a hierarchy of biological encodings of the raw variants. Our knowledge graph is derived from existing genomic annotations and ontologies, to ensure that each putative encoding is biologically relevant, but the Harvestman framework can incorporate alternative, user-defined knowledge graphs

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call