Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data

Trevor S Frisby,Christopher J Langmead,Carl Kingsford,Shawn J Baker,Guillaume Marçais,Quang Minh Hoang

doi:10.1186/s12859-021-04096-6

Trevor S Frisby, Christopher J Langmead + Show 4 more

Open Access

https://doi.org/10.1186/s12859-021-04096-6

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Apr 1, 2021
Citations: 1	License type: open-access

Affiliation: Carnegie Mellon University

Abstract

BackgroundSupervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building.ResultsWe demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare Harvestman to existing feature selection methods and demonstrate that our method is more parsimonious—it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier.ConclusionHarvestman is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , Harvestman automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, Harvestman is faster and selects features more parsimoniously.

Highlights

Supervised learning from high-throughput sequencing data presents many challenges
To further show robustness of the method, we report classification accuracy obtained with three different classifier types, logistic regression (LR) with no regularization, random forest (RF) using 100 trees, and support vector machine (SVM) with radial basis function kernel
We have introduced Harvestman, a new approach to supervised hierarchical feature selection, and demonstrated it on our knowledge graphs built from high-throughput sequence data

Summary

Introduction

Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. Variant calls may not be the optimal encoding for a given learning task, which contributes to poor predictive capabilities To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. Introduction Supervised learning from high-throughput sequencing data presents many challenges [1, 2] First among these is the curse of dimensionality, which predisposes learning algorithms to overfitting and imposes barriers to scalability [3]. The most informative, and biologically relevant encoding of a given variant may be at a higher level of organization, such as a perturbation in a particular exon, transcript, or pathway This paper addresses both challenges by introducing Harvestman, a method that automatically identifies a non-redundant set of relevant features chosen from a hierarchy of biological encodings of the raw variants. Our knowledge graph is derived from existing genomic annotations and ontologies, to ensure that each putative encoding is biologically relevant, but the Harvestman framework can incorporate alternative, user-defined knowledge graphs

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Two methods for constructing a gene ontology-based feature network for a Bayesian network classifier and applications to datasets of aging-related genes
Cen Wan ... Alex A Freitas
-
Cen Wan, et. al.Cen Wan ... Alex A Freitas
09 Sep 2015
09 Sep 2015

An empirical evaluation of hierarchical feature selection methods for classification in bioinformatics datasets with gene ontology-based features
Cen Wan ... Alex A Freitas
Artificial Intelligence Review | VOL. 50
Cen Wan, et. al.Cen Wan ... Alex A Freitas
30 Jan 2017
Artificial Intelligence Review | VOL. 50

Robust hierarchical feature selection with a capped [formula omitted]-norm
Xinxin Liu ... Hong Zhao
Neurocomputing | VOL. 443
Xinxin Liu, et. al.Xinxin Liu ... Hong Zhao
10 Mar 2021
Neurocomputing | VOL. 443

Feature selection via maximizing inter-class independence and minimizing intra-class redundancy for hierarchical classification
Jie Shi ... Hong Zhao
Information Sciences | VOL. 626
Jie Shi, et. al.Jie Shi ... Hong Zhao
10 Jan 2023
Information Sciences | VOL. 626

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics