A Theory of Genetic Analysis Using Transcriptomic Phenotypes

David Angeles-Albores

doi:10.7907/jrns-ns05.

Abstract

This thesis deals with the conceptual and computational framework required to use transcriptomes as effective phenotypes for genetic analysis. I demonstrate that there are powerful theoretical reasons why Batesonian epistasis should feature prominently in transcriptional phenotypes. I also show how to compute and interpret the aggregate statistics for transcriptome-wide epistasis and transcriptome-wide dominance using whole-organism transcriptomic profiles of C. elegans mutants. Finally, I developed the WormBase Enrichment Suite for enrichment analysis of genomic data. RNA-seq as a tool has enormous potential because it relies on protocols that are fast, simple and increasingly cheap. In spite of their potential, transcriptomes have seen their use largely limited to single-factor experiments. Even when many transcriptomes are collected, the main analytic approach is to apply clustering algorithms that correlate responses but do not have any power to identify causal mechanisms. I demonstrate that if a complete genetic experimental design is used (in the form of a full two-factor matrix), transcriptomes can establish genetic interactions between a pair of genes without the need for clustering algorithms. Surprisingly, when we performed epistasis analyses of hypoxia pathway mutants in C. elegans we did not simply observe a generalized epistatic interaction between the mutants. In fact, the transcriptomes recapitulated the same Batesonian epistatic relationship that had been observed using classical phenotypes. In other words, we observed that the transcriptomic phenotype of one gene can be masked by the transcriptomic phenotype of a second gene, such that a double mutant of these two genes has exactly the same phenotype as a single mutant of the epistatic gene. Motivated by this observation, we developed methods to recognize and interpret Batesonian epistasis at the transcriptomic level. This method relies on the calculation of a single aggregate coefficient that we named the transcriptome-wide epistasis coefficient. The observation that Batesonian epistasis could be reproduced on a transcriptomic level was surprising. To explain how transcriptome-wide epistasis can arise, I studied a simplified model of transcriptional regulation using statistical mechanics. These studies demonstrate that epistatic analysis is equivalent to a perturbative analysis of the partition function of a promoter. Moreover, these studies revealed that a sufficient condition for Batesonian epistasis to occur is if the two genes encode variables that are transformed and multiplied together to form an effective single compound variable. Finally, these studies clearly demonstrate the connection between statistical (or generalized) epistasis and Batesonian epistasis and establish a physical basis for genetic logic. Genetic analyses of gene functional units can also be carried out using allelic series in tandem with complementation (also known as dominance) tests. I developed a statistical coefficient known as transcriptome-wide dominance to enable analyses of allelic series using expression profiles. A crucial aspect of allelic series is the ability to enumerate the independent phenotypes associated with an arbitrary set of alleles. I developed the concept of phenotypic classes as a transcriptomic analogue of classical phenotypes for this purpose. Briefly, a phenotypic class is a set of transcripts that are differentially expressed in a specific set of genotypes. Thus, an allelic series consisting of two mutant alleles (and a wild-type) can at most result in 7 phenotypic classes. However, some of these phenotypic classes may be artifactual as a result of the significant false positive and false negative rates that are associated with RNA-seq. I developed a simple algorithm that tries to identify phenotypic classes that are artifactual, though often these classes may also be identified through a critical evaluation of their biological implications. I applied these concepts to a small allelic series of the dpy-22 gene, which encodes a Mediator subunit in C. elegans, and identified 3–4 functional units along with their sequence requirements. Finally, I developed the WormBase Enrichment Suite by implementing a hypergeometric test on the tissue, gene and phenotype ontology for C. elegans. The importance of this tool derives mainly from its integration to WormBase, the repository of all C. elegans knowledge, which means that the databases that are tested will undergo continuous improvement and curation, and thus will yield the most accurate results.

Full Text