Abstract
BackgroundWe have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem — predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree.MethodsIn the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data.ResultsUsing the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast — the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors — and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from .
Highlights
We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass
The Robust GeneClass algorithm We apply the "robust" implementation of the GeneClass algorithm to study the regulatory response of the yeast S. cerevisiae under environmental stress and DNA damage conditions
The output is a prediction function in the form of an alternating decision tree (ADT). This function predicts up/down regulation of a gene-experiment example using a tree based on questions of the form, "Is motif X present in the upstream region of the gene and is the state of regulator Y up/down in that experiment?" Unlike regular decision trees, which make yes/no predictions, alternating decision trees (ADTs) generate real-valued prediction scores whose sign gives the up/down prediction and whose size gives a measure of confidence in that prediction
Summary
Among recent studies that try to learn a global model of gene regulation in an organism — rather than extracting statistically significant regulatory patterns — most attempt to discover structure in the dataset as formalized by probabilistic models [5,6,7,8,9,10] (often graphical models or Bayesian networks) Most of these structure learning approaches build a model from training data and provide useful exploratory tools that allow the user to generate biological hypotheses about transcriptional regulation from the model; these models are rarely used to try to make accurate predictions about which genes will be up- or down-regulated in new or held-out experiments (test data). While these probabilistic approaches give a rich description of biological data and a way to generate hypotheses, the often missing validation on an independent test set makes it difficult to directly compare performance of the different algorithms or to decide whether the model has overfit the training data
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.