Computational identification and characterization of genotype-phenotype associations

Simo Kitanovski

doi:10.17185/duepublico/73531

Abstract

The adaptive immune system is essential in defending the host against diverse and rapidly evolving pathogens, or controlling diseases such as cancer. To perform its duty, the adaptive immunity depends on enormously diverse repertoires of B- and T-cell receptors (BCRs and TCRs). In light of the rapid advancement in high-throughput sequencing (HTSeq) technologies, it is now possible to study the properties of these repertoires, which is central to the development of vaccines, new prognostic markers, and treatments for cancer and autoimmune diseases. One challenge in extracting biologically meaningful information from HTSeq data comes from the fact that this data is both complex and massive. We can anticipate that additional improvements in HTSeq technologies will generate even larger datasets with hundreds of millions of sequenced reads from potentially hundreds or thousands of individuals. To meet these challenges, we need new computational methods. Furthermore, the biological processes that contribute to the diversity of BCR repertoires are stochastic in nature. This calls for the use of probabilistic modeling to accurately describe these processes. I begin this thesis with an introduction of the most relevant concepts of B-cell mediated immunity (chapter 1). This is followed by general introduction of probabilistic modeling for Bayes inference (chapter 2). The main result of this thesis are computational methods, which are summarized in two publications (chapter 3). In the first publication (section 3.1), I introduce IgGeneUsage, a computational tool for probabilistic detection of differential Ig gene usage under different biological conditions (e.g. infected vs. healthy subjects). We know that V(D)J recombination of different germline-encoded Ig genes is an important component that contributes to the enormous diversity of BCR repertoires. Detection of disrupted usage of Ig genes has previously been reported e.g. in chronic lymphocytic leukemia, where specific Ig gene disruptions may be used as prognostic markers for different diseases. Despite the importance of this feature, most analyses of differential Ig gene usage are either qualitative, or rely on inadequate statistical methods. IgGeneUsage employs a hierarchical probabilistic model for Bayes inference, and is able to cope with complex and noisy Ig gene usage data. The results reported by IgGeneUsage are statistically sound, and easy to interpret by non-statisticians. The performance of IgGeneUsage was compared against tools that are commonly used for differential Ig gene usage, such as the Welch’s t-test (t-test) and Wilcoxon signed-rank test (U-test). This evaluation was performed based on publicly available data of human BCR repertoires, where biologically replicated datasets were available for each repertoire. The evaluation revealed that IgGeneUsage generates consistent results in each replicate, whereas the t- and U-test produce divergent results. In the second publication (section 3.2), I introduce the results of a collaborative project in which we examined the effects of chronic Hepatitis C Virus (HCV) infection on the human BCR repertoire. This involved diverse computational analyses based on HTSeq data of human immunoglobulin heavy chain VDJ rearrangements, obtained from different B-cell populations in healthy and HCV infected individuals. In patients infected with HCV, our analyses revealed large perturbations such as aberrant Ig gene usage, clonal expansions, and changes in CDR3 length. To perform these analyses, we have developed numerous computational methods for the different stages of BCR repertoire profiling.

Full Text