Abstract
Transcription factor (TF) binding site prediction remains a challenge in gene regulatory research due to degeneracy and potential variability in binding sites in the genome. Dozens of algorithms designed to learn binding models (motifs) have generated many motifs available in research papers with a subset making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF and the lack of a standardized assessment technique makes it difficult for biologists to make an appropriate choice of binding model and for algorithm developers to benchmark, test and improve on their models. In this study, we review and evaluate the approaches in use, highlight differences and demonstrate the difficulty of defining a standardized motif assessment approach. We review scoring functions, motif length, test data and the type of performance metrics used in prior studies as some of the factors that influence the outcome of a motif assessment. We show that the scoring functions and statistics used in motif assessment influence ranking of motifs in a TF-specific manner. We also show that TF binding specificity can vary by source of genomic binding data. Finally, we demonstrate that information content of a motif is not in isolation a measure of motif quality but is influenced by TF binding behaviour. We conclude that there is a need for an easy-to-use tool that presents all available evidence for a comparative analysis.
Highlights
Understanding gene regulation remains a long-standing problem in biological research
We focus on Transcription factor (TF) binding models represented as a position weight matrix (PWM) and aim to determine how the choice and length of benchmark sequences, scoring functions, and the statistics influence motif assessment
We have described a comparative analysis on the effect of scoring functions, chromatin immunoprecipitation (ChIP)-seq test data processing and statistics on motif assessment
Summary
Understanding gene regulation remains a long-standing problem in biological research. The main players, transcription factors (TFs), are proteins that bind to short and potentially degenerate sequence patterns (motifs) at gene regulatory sites to promote or repress expression of target genes. The search for a code to predict binding sites and model binding affinity of TFs has led to several experimental techniques and motif discovery algorithms being developed (Figure 1). In addition to providing high resolution data for motif discovery, they are a useful resource to test the quality of the available motifs since they are TF specific. A position weight matrix (PWM) is the common form of representing TF binding specificity. Motifs can be found using a variety of methods including algorithms that do de novo motif discovery from sequences containing binding sites and in vitro methods such as protein binding microarrays (PBM) and high-throughput systematic evolution of ligands by exponential enrichment (HT-SELEX)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.