Machine learning methods for HIV/AIDS diagnostics and therapy planning

Sandhya Prabhakaran

doi:10.5451/unibas-006230957

Abstract

The focus of the thesis is the development and application of Machine Learning methods to the domain of HIV/AIDS diagnostics and therapy planning. The thesis addresses this domain from two different facets. In Facet I, we analyse the genetically-diverse HIV populations present in an infected patient's blood samples. Understanding genetic diversity is crucial for further insights into the viral-host interactions, evolution of drug-resistant viral lineage within an infected host and for personalised medication where drugs are prescribed to a patient based on his/her viral lineage. With the help of recent sequencing technologies, one can generate shorter viral strains called reads from infected blood samples. These reads are made use of in genetic-diversity studies. The puzzle is in matching every read to its parent strain or haplotype, which can be seen as a standard clustering task. Given error-prone reads with limited lengths, the main modelling challenge is that non-overlapping reads do not have any suitable a priori pairwise similarity measure; this leads to a non-standard clustering problem. None of the previous approaches have provided a convincing strategy to solve this issue. In this work we overcome this problem by introducing a propagating Dirichlet Process Mixture Model. In Facet II, we take the first steps to identify similarity patterns between drugs used in HIV/AIDS therapy and active chemical compounds. Currently there exists only a frugal number of anti-HIV drugs available to prepare drug cocktails. When a viral lineage becomes resistant to a particular drug, it tends to show resistance to other drugs in the same drug category, a property called cross-resistance. This situation demands development of newer and resilient drugs and thus, an indepth understanding of similarities between the current drugs and active chemical compounds is necessary. This is done by examining a landscape of active chemical compounds that also contains the drugs. With respect to this, we develop two models: one for Network Inference and another for Automatic Archetype Analysis. For network inference, we present a fully probabilistic approach that infers networks from pairwise Euclidean distances of 'n' objects where the objects are active chemical compounds. For automatic archetype analysis, we develop a sparsity-inducing model based on a Group-Lasso formulation that identifies the representative/archetypal objects given a set of 'n' objects (or active chemical compounds). The model is aided with a well-defined criterion, Bayesian Information Criterion (BIC), that enables automatic model selection.

Full Text