Abstract
Since its identification in 1983, HIV-1 has been the focus of a research effort unprecedented in scope and difficulty, whose ultimate goals — a cure and a vaccine – remain elusive. One of the fundamental challenges in accomplishing these goals is the tremendous genetic variability of the virus, with some genes differing at as many as 40% of nucleotide positions among circulating strains. Because of this, the genetic bases of many viral phenotypes, most notably the susceptibility to neutralization by a particular antibody, are difficult to identify computationally. Drawing upon open-source general-purpose machine learning algorithms and libraries, we have developed a software package IDEPI (IDentify EPItopes) for learning genotype-to-phenotype predictive models from sequences with known phenotypes. IDEPI can apply learned models to classify sequences of unknown phenotypes, and also identify specific sequence features which contribute to a particular phenotype. We demonstrate that IDEPI achieves performance similar to or better than that of previously published approaches on four well-studied problems: finding the epitopes of broadly neutralizing antibodies (bNab), determining coreceptor tropism of the virus, identifying compartment-specific genetic signatures of the virus, and deducing drug-resistance associated mutations. The cross-platform Python source code (released under the GPL 3.0 license), documentation, issue tracking, and a pre-configured virtual machine for IDEPI can be found at https://github.com/veg/idepi.
Highlights
The challenge of predicting a viral phenotype from sequence data has many motivating examples in HIV-1 research
IDEPI is customizable: different machine learning algorithms implemented in scikit-learn can be used; new sequence features can be defined using a well-specified application programming interface (API); various feature selection approaches can be used; performance can be optimized with respect to many metrics
Simulated data In order to establish baseline performance of IDEPI where the true "phenotype" is known, we simulated the evolution of N~241 HIV-1 protein envelope sequences subject to directional selective pressure applied to sites in an epitope along a subset of terminal tree branches selected at random
Summary
The challenge of predicting a viral phenotype from sequence data has many motivating examples in HIV-1 research. [2]) are well established and used both in research [3] and in clinical practice [4] These algorithms have been developed based on large training sets using phenotypic assays, for example those measuring half maximal inhibitory concentration (IC50) of an antiretroviral drug (ARV) [5] to label sequences resistant or susceptible. As a byproduct of bNab characterization, large panels of phenotypic (IC50) and matched envelope sequences have been generated, and several recent efforts [44,45,46,47,48] have been directed at applying machine learning techniques to these data in order to predict the resistance phenotypes of HIV-1 strains and to infer antibody epitopes. IDEPI is customizable: different machine learning algorithms implemented in scikit-learn can be used; new sequence features can be defined using a well-specified application programming interface (API); various feature selection approaches (e.g. forward or backward selection) can be used; performance can be optimized with respect to many metrics (e.g. sensitivity)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.