Abstract

Cis-regulatory modules (CRMs) and motifs play a central role in tissue and condition-specific gene expression. Here we present Imogene, an ensemble of statistical tools that we have developed to facilitate their identification and implemented in a publicly available software. Starting from a small training set of mammalian or fly CRMs that drive similar gene expression profiles, Imogene determines de novo cis-regulatory motifs that underlie this co-expression. It can then predict on a genome-wide scale other CRMs with a regulatory potential similar to the training set. Imogene bypasses the need of large datasets for statistical analyses by making central use of the information provided by the sequenced genomes of multiple species, based on the developed statistical tools and explicit models for transcription factor binding site evolution. We test Imogene on characterized tissue-specific mouse developmental CRMs. Its ability to identify CRMs with the same specificity based on its de novo created motifs is comparable to that of previously evaluated ‘motif-blind’ methods. We further show, both in flies and in mammals, that Imogene de novo generated motifs are sufficient to discriminate CRMs related to different developmental programs. Notably, purely relying on sequence data, Imogene performs as well in this discrimination task as a previously reported learning algorithm based on Chromatin Immunoprecipitation (ChIP) data for multiple transcription factors at multiple developmental stages.

Highlights

  • The identification and functional characterization of the non-coding sequences that direct the spatio-temporal specificity of gene expression in eukaryotes is of fundamental importance in developmental biology [1] and can find crucial applications in medicine [2].These regulatory sequences are generally located distally from gene promoters and termed enhancers or more generically cis-regulatory modules (CRMs) since they can either enhance or repress gene expression [3]

  • Starting from a chosen training set, Genmot performs its task in two steps (I and II in Figure 1): (I) Genmot first enlarges the training set with aligned orthologous sequences in other related sequenced genomes, as shown in Supplementary Figure S1 (for the mouse, the 11 other aligned mammalian sequenced genomes with high coverage presently available on the Ensembl project [15], the 11 other Drosophilae sequenced genomes [16] for the fly)

  • This probability weight matrices (PWMs) is refined by scanning the training set to find all the PWM binding sites in the training set, i.e. all nucleotide long sequences in the training set that have a binding score above a generation threshold score Sg chosen at the procedure onset (Sg = 13 bits is the default value)

Read more

Summary

Introduction

The identification and functional characterization of the non-coding sequences that direct the spatio-temporal specificity of gene expression in eukaryotes is of fundamental importance in developmental biology [1] and can find crucial applications in medicine [2].These regulatory sequences are generally located distally from gene promoters and termed enhancers or more generically cis-regulatory modules (CRMs) since they can either enhance or repress gene expression [3] They usually are of the order of 500 nucleotides (nts) long and can be located as far as several mega base-pairs away from the transcription start sites (TSSs) of the genes that they regulate. If putative orthologous sequences were found in enough species to satisfy our conservation requirements (see below), the site was declared as a putative conserved site for a regulatory motif This filtering step resulted in final sets of 39 limb CRMs (minimal length 789 bp, maximal length 9052 bp and average length 3045 bp) and 29 neural tube CRMs (minimal length 585 bp, maximal length 3045 bp and average length 2419 bp)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call