Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets

George W Bassel,Julietta Marquez,Michael J Holdsworth,Enrico Glaab,Jaume Bacardit

doi:10.1105/tpc.111.088153

George W Bassel, Julietta Marquez + Show 3 more

Open Access

https://doi.org/10.1105/tpc.111.088153

Copy DOI

Abstract

The meta-analysis of large-scale postgenomics data sets within public databases promises to provide important novel biological knowledge. Statistical approaches including correlation analyses in coexpression studies of gene expression have emerged as tools to elucidate gene function using these data sets. Here, we present a powerful and novel alternative methodology to computationally identify functional relationships between genes from microarray data sets using rule-based machine learning. This approach, termed "coprediction," is based on the collective ability of groups of genes co-occurring within rules to accurately predict the developmental outcome of a biological system. We demonstrate the utility of coprediction as a powerful analytical tool using publicly available microarray data generated exclusively from Arabidopsis thaliana seeds to compute a functional gene interaction network, termed Seed Co-Prediction Network (SCoPNet). SCoPNet predicts functional associations between genes acting in the same developmental and signal transduction pathways irrespective of the similarity in their respective gene expression patterns. Using SCoPNet, we identified four novel regulators of seed germination (ALTERED SEED GERMINATION5, 6, 7, and 8), and predicted interactions at the level of transcript abundance between these novel and previously described factors influencing Arabidopsis seed germination. An online Web tool to query SCoPNet has been developed as a community resource to dissect seed biology and is available at http://www.vseed.nottingham.ac.uk/.

Highlights

Advances in postgenomic technologies and their use by the scientific community are generating increasing quantities of high quality genome-wide transcriptomic data sets
Common approaches to uncovering gene function from transcriptome data include both differential expression across conditions and the calculation of correlations between gene expression levels across a large number of samples (Zimmermann et al, 2004; Toufighi et al, 2005; Brady and Provart, 2009; Usadel et al, 2009)
We used as the input data, or training set, for this study 122 publicly available microarray data sets generated from imbibed Arabidopsis seeds (Ogawa et al, 2003; Yamauchi et al, 2004; Nakabayashi et al, 2005; Cadman et al, 2006; Penfield et al, 2006; Carrera et al, 2007, 2008; Finch-Savage et al, 2007; Bassel et al, 2008)

Summary

Introduction

Advances in postgenomic technologies and their use by the scientific community are generating increasing quantities of high quality genome-wide transcriptomic data sets. Common approaches to uncovering gene function from transcriptome data include both differential expression across conditions and the calculation of correlations between gene expression levels across a large number of samples (Zimmermann et al, 2004; Toufighi et al, 2005; Brady and Provart, 2009; Usadel et al, 2009) This latter approach, termed coexpression analysis, is based on the guilt-by-association paradigm, where genes under the control of a common transcriptional regulatory mechanism have a greater probability of being involved in the same biochemical or developmental pathway and for their corresponding proteins to interact (Hughes et al, 2000; Usadel et al, 2009; Lee et al, 2010; Bassel et al, 2011; Mutwil et al, 2011). This correlative approach has successfully elucidated gene function in Arabidopsis thaliana through the use of genome-wide coexpression networks, leading to the identification of genes essential in the life cycle of Arabidopsis (Mutwil et al, 2010) and in the regulation of seed germination (Bassel et al, 2011)

Objectives

Methods

Results

Conclusion