Abstract
BackgroundThe detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter regions.ResultsThis methodology (SIGMA) has been tested on genomic sequence data for Homo sapiens and Mus musculus. SIGMA has been compared with different publicly available alternatives for motif detection, such as MEME/MAST, Biostrings (Bioconductor package), MotifRegressor, and previous work such Qresiduals projections or information theoretic based detectors. Comparative results, in the form of Receiver Operating Characteristic curves, show how, in 70 % of the studied Transcription Factor Binding Sites, the SIGMA detector has a better performance and behaves more robustly than the methods compared, while having a similar computational time. The performance of SIGMA can be explained by its parametric simplicity in the modelling of the non-linear co-variability in the binding motif positions.ConclusionsSequence Information Gain based Motif Analysis is a generalisation of a non-linear model of the cis-regulatory sequences detection based on Information Theory. This generalisation allows us to detect transcription factor binding sites with maximum performance disregarding the covariability observed in the positions of the training set of sequences. SIGMA is freely available to the public at http://b2slab.upc.edu.
Highlights
The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved
We propose a generalisation of a nonlinear model based on Information Theory, which allows modeling DNA contact by the protein and the biological interaction among binding sites using a small training set of sequences (5–50 sequences model)
A new methodology based on a discriminant analysis of two information theoretic measures has been proposed for binding site detection
Summary
The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. The binding between specific proteins and their target sites in DNA is a key step in the control of the transcription process These proteins – transcription factors (TF) – recognise specific motifs in DNA known as Transcription Factor Binding Sites (TFBS) or cis-regulatory sequences. TFBS are usually very short (5 to 20 base pairs long) and highly degenerate, which gives rise to an extremely difficult identification problem due to low statistical power, as short sequences are expected to occur at random every few hundred base pairs. Due to their high variability, a consensus
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.