Borrowing information across genes and experiments for improved error variance estimation in microarray data analysis and statistical inferences for gene expression heterosis

Tieming Ji

doi:10.31274/etd-180810-1548

Abstract

The advancement in microarray technology enables the simultaneous measurement of expression levels of thousands of genes. However, due to the relatively high cost of making a replicate in a microarray experiment, the number of replicates in a single experiment is typically small. This results in the “small n, large p” problem for statistical inferences, where there are gene expression measurements for many genes, but only a few biological replicates (or observations) for each gene. In this dissertation, we develop statistical models and methods for microarray data to borrow information across genes and/or even across experiments to improve statistical inferences for specific biological questions. In Chapter 2, we develop statistical methods to improve the estimation of gene expression error variances. Good estimation of error variances is crucial for detecting differentially expressed genes (genes that differ in mean expression level across treatments or conditions of interest). Since the sample size available for each gene is often low, the usual unbiased estimator of the error variance can be unreliable. Shrinkage methods, including empirical Bayes approaches that borrow information across genes to produce more stable estimates, have been developed in recent years. Because the same microarray platform is often used for at least several experiments to study similar biological systems, there is an opportunity to improve variance estimation further by borrowing information not only across genes but also across experiments. We propose a lognormal model for error variances that involves random gene effects and random experiment effects. Based on the model, we develop an empirical Bayes estimator of the error variance for each combination of gene and experiment and call this estimator BAGE because information is Borrowed Across Genes and Experiments. A permutation strategy is used to make inference about the differential expression status of each gene. Simulation studies with data generated from different probability models and real microarray data show that our method outperforms existing approaches. xii In Chapter 3, we develop statistical methods to improve the estimation and testing of gene expression heterosis. Heterosis, also known as the hybrid vigor, refers to the superior phenotype of the hybrid offspring relative to its two inbred parents. Though the heterosis phenomenon has been extensively utilized in agriculture for over a century, the molecular basis is still unknown. In an effort to understand the basic mechanisms responsible for the phenotypic heterosis at the molecular level, researchers have begun to compare expression levels of thousands of genes in the parental inbred lines and their offspring to find genes that exhibit gene expression heterosis. In our study, we focus on three types of gene expression heterosis: high-parent heterosis, lowparent heterosis and mid-parent heterosis. Currently, the sample average method is the most commonly used method for estimation and testing of gene expression heterosis. However, the sample average estimators underestimate high-parent heterosis and low-parent heterosis, which consequently leads to loss of power in hypothesis testing. Though the sample average estimator for mid-parent heterosis is unbiased, with only a few replicates in a typical microarray experiment, estimation is highly variable. To improve the estimation and testing of all three types of gene expression heterosis, we develop a hierarchical model, which permits information sharing across genes. Based on the model, we derive empirical Bayes estimators, and test gene expression heterosis using posterior probabilities. The effectiveness of our approach is demonstrated through simulations based on two real heterosis microarray experiments as well as hypothetical probability models that violate our model assumptions. Chapter 4 presents statistical analysis of a soil-based carbon sequestration experiment. Driven by global climate change due to the increasing level of atmospheric carbon dioxide, researchers have proposed a soil-based carbon sequestration approach. A soil-based carbon sequestration approach reduces carbon dioxide emission from crop residues after harvesting and sequesters more carbon into the land as a soil nutrient. Previous research has reported significant differences across species in their rates of residue decomposition and the amount of carbon dioxide emission. Because the biomass composition varies across maize genotypes, we hypothesize that there are also differences among genotypes within the maize species in their rates of biomass decomposition and abilities of carbon sequestration. We designed and performed a longitudinal experiment to measure the amount of carbon dioxide flux from crop stover samples xiii of 14 maize varieties. Flux observations for more than 150 days were collected. We modeled the logarithm of carbon dioxide flux as a linear function of genotype, day, and genotype-by-day interaction effects as well as several other important fixed and random factors. The analysis results show significant differences among maize varieties with respect to the accumulated carbon dioxide flux from crop residues as well as flux pattern over time. We also investigate relationships of carbon dioxide emission and several potentially influential chemical compounds in the maize residue biomass composition. These results suggest the potential for development of “carbon capturing crops” through bioengineering or hybrid methods.

Full Text