AbstractRNA-sequencing (RNA-seq) technology allows for the identification of differentially expressed genes, which are genes whose mean transcript abundance levels vary across conditions. In practice, RNA-seq datasets often include covariates that are of primary interest in addition to a set of covariates that are subject to selection. Some of these covariates may be relevant to gene expression levels, while others may be irrelevant. Ignoring relevant covariates or attempting to adjust for the effect of irrelevant covariates can compromise the identification of differentially expressed genes. To address this issue, we propose a variable selection method that uses pseudo-variables to control the expected proportion of selected covariates that are irrelevant. Our method accurately selects relevant covariates while keeping the false selection rate below a specified level. We demonstrate that our method outperforms existing methods for detecting differentially expressed genes when working with available covariates. Our method is implemented in function of the R package , which is available at www.github.com/ntyet/csrnaseq. The analysis and simulation are available at www.github.com/ntyet/csrnaseq/tree/main/analysis.
Read full abstract