Abstract
BackgroundSupervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building.ResultsWe demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare Harvestman to existing feature selection methods and demonstrate that our method is more parsimonious—it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier.ConclusionHarvestman is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , Harvestman automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, Harvestman is faster and selects features more parsimoniously.
Highlights
Supervised learning from high-throughput sequencing data presents many challenges
To further show robustness of the method, we report classification accuracy obtained with three different classifier types, logistic regression (LR) with no regularization, random forest (RF) using 100 trees, and support vector machine (SVM) with radial basis function kernel
We have introduced Harvestman, a new approach to supervised hierarchical feature selection, and demonstrated it on our knowledge graphs built from high-throughput sequence data
Summary
Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. Variant calls may not be the optimal encoding for a given learning task, which contributes to poor predictive capabilities To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. Introduction Supervised learning from high-throughput sequencing data presents many challenges [1, 2] First among these is the curse of dimensionality, which predisposes learning algorithms to overfitting and imposes barriers to scalability [3]. The most informative, and biologically relevant encoding of a given variant may be at a higher level of organization, such as a perturbation in a particular exon, transcript, or pathway This paper addresses both challenges by introducing Harvestman, a method that automatically identifies a non-redundant set of relevant features chosen from a hierarchy of biological encodings of the raw variants. Our knowledge graph is derived from existing genomic annotations and ontologies, to ensure that each putative encoding is biologically relevant, but the Harvestman framework can incorporate alternative, user-defined knowledge graphs
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.