Abstract

BackgroundStatistics in ranked lists is useful in analysing molecular biology measurement data, such as differential expression, resulting in ranked lists of genes, or ChIP-Seq, which yields ranked lists of genomic sequences. State of the art methods study fixed motifs in ranked lists of sequences. More flexible models such as position weight matrix (PWM) motifs are more challenging in this context, partially because it is not clear how to avoid the use of arbitrary thresholds.ResultsTo assess the enrichment of a PWM motif in a ranked list we use a second ranking on the same set of elements induced by the PWM. Possible orders of one ranked list relative to another can be modelled as permutations. Due to sample space complexity, it is difficult to accurately characterize tail distributions in the group of permutations. In this paper we develop tight upper bounds on tail distributions of the size of the intersection of the top parts of two uniformly and independently drawn permutations. We further demonstrate advantages of this approach using our software implementation, mmHG-Finder, which is publicly available, to study PWM motifs in several datasets. In addition to validating known motifs, we found GC-rich strings to be enriched amongst the promoter sequences of long non-coding RNAs that are specifically expressed in thyroid and prostate tissue samples and observed a statistical association with tissue specific CpG hypo-methylation.ConclusionsWe develop tight bounds that can be calculated in polynomial time. We demonstrate utility of mutual enrichment in motif search and assess performance for synthetic and biological datasets. We suggest that thyroid and prostate-specific long non-coding RNAs are regulated by transcription factors that bind GC-rich sequences, such as EGR1, SP1 and E2F3. We further suggest that this regulation is associated with DNA hypo-methylation.

Highlights

  • Statistics in ranked lists is useful in analysing molecular biology measurement data, such as differential expression, resulting in ranked lists of genes, or ChIP-Seq, which yields ranked lists of genomic sequences

  • For example, an experiment may lead to measurement results pertaining to genes and questions are asked about the properties of genes for which these were high or low

  • Another example consists of one ranking according to differential expression and one according to prediction scores for miRNA targets. miTEA [11,12] addresses this latter case by statistically assessing the enrichment of miRNA targets in a ranked list of genes

Read more

Summary

Introduction

Statistics in ranked lists is useful in analysing molecular biology measurement data, such as differential expression, resulting in ranked lists of genes, or ChIP-Seq, which yields ranked lists of genomic sequences. For example, an experiment may lead to measurement results pertaining to genes and questions are asked about the properties of genes for which these were high or low. This is an example, and the set of differential expression data or of correlations computed between genomic DNA copy number and expression [4,5,6]. A slightly different variant of mmHG is described later

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.