A classification-based framework for predicting and analyzing gene regulatory response.

Anshul Kundaje,Mihir Shah,Chris H Wiggins,Christina Leslie,Manuel Middendorf,Yoav Freund

doi:10.1186/1471-2105-7-s1-s5

Abstract

BackgroundWe have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem — predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree.MethodsIn the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data.ResultsUsing the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast — the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors — and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from .

Highlights

We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass
The Robust GeneClass algorithm We apply the "robust" implementation of the GeneClass algorithm to study the regulatory response of the yeast S. cerevisiae under environmental stress and DNA damage conditions
The output is a prediction function in the form of an alternating decision tree (ADT). This function predicts up/down regulation of a gene-experiment example using a tree based on questions of the form, "Is motif X present in the upstream region of the gene and is the state of regulator Y up/down in that experiment?" Unlike regular decision trees, which make yes/no predictions, alternating decision trees (ADTs) generate real-valued prediction scores whose sign gives the up/down prediction and whose size gives a measure of confidence in that prediction

Summary

Introduction

Among recent studies that try to learn a global model of gene regulation in an organism — rather than extracting statistically significant regulatory patterns — most attempt to discover structure in the dataset as formalized by probabilistic models [5,6,7,8,9,10] (often graphical models or Bayesian networks) Most of these structure learning approaches build a model from training data and provide useful exploratory tools that allow the user to generate biological hypotheses about transcriptional regulation from the model; these models are rarely used to try to make accurate predictions about which genes will be up- or down-regulated in new or held-out experiments (test data). While these probabilistic approaches give a rich description of biological data and a way to generate hypotheses, the often missing validation on an independent test set makes it difficult to directly compare performance of the different algorithms or to decide whether the model has overfit the training data

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 1, 2006
Citations: 43	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

A classification-based framework for predicting and analyzing gene regulatory response.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Predicting genetic regulatory response using classification.
Manuel Middendorf ... Christina Leslie
Bioinformatics | VOL. Suppl 20 1
Manuel Middendorf, et. al.Manuel Middendorf ... Christina Leslie
04 Aug 2004
Bioinformatics | VOL. Suppl 20 1

Predicting Genetic Regulatory Response Using Classification: Yeast Stress Response
Manuel Middendorf ... Chris Wiggins
-
Manuel Middendorf, et. al.Manuel Middendorf ... Chris Wiggins
01 Jan 2004
01 Jan 2004

Identification of New Genes Regulated by the Crt1 Transcription Factor, an Effector of the DNA Damage Checkpoint Pathway in Saccharomyces cerevisiae
Jolanta Zaim ... Andrzej M Kierzek
Journal of Biological Chemistry | VOL. 280
Jolanta Zaim, et. al.Jolanta Zaim ... Andrzej M Kierzek
01 Jan 2004
Journal of Biological Chemistry | VOL. 280

Multiple Traits for People Identification
Maria De Marsico ... Michele Nappi
-
Maria De Marsico, et. al.Maria De Marsico ... Michele Nappi
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A classification-based framework for predicting and analyzing gene regulatory response.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics