Improved linking of motifs to their TFs using domain information.

Nina Baumgarten,Florian Schmidt,Marcel H Schulz,Bonnie Berger

doi:10.1093/bioinformatics/btz855

Abstract

MotivationA central aim of molecular biology is to identify mechanisms of transcriptional regulation. Transcription factors (TFs), which are DNA-binding proteins, are highly involved in these processes, thus a crucial information is to know where TFs interact with DNA and to be aware of the TFs’ DNA-binding motifs. For that reason, computational tools exist that link DNA-binding motifs to TFs either without sequence information or based on TF-associated sequences, e.g. identified via a chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiment.In this paper, we present MASSIF, a novel method to improve the performance of existing tools that link motifs to TFs relying on TF-associated sequences. MASSIF is based on the idea that a DNA-binding motif, which is correctly linked to a TF, should be assigned to a DNA-binding domain (DBD) similar to that of the mapped TF. Because DNA-binding motifs are in general not linked to DBDs, it is not possible to compare the DBD of a TF and the motif directly. Instead we created a DBD collection, which consist of TFs with a known DBD and an associated motif. This collection enables us to evaluate how likely it is that a linked motif and a TF of interest are associated to the same DBD. We named this similarity measure domain score, and represent it as a P-value. We developed two different ways to improve the performance of existing tools that link motifs to TFs based on TF-associated sequences: (i) using meta-analysis to combine P-values from one or several of these tools with the P-value of the domain score and (ii) filter unlikely motifs based on the domain score.ResultsWe demonstrate the functionality of MASSIF on several human ChIP-seq datasets, using either motifs from the HOCOMOCO database or de novo identified ones as input motifs. In addition, we show that both variants of our method improve the performance of tools that link motifs to TFs based on TF-associated sequences significantly independent of the considered DBD type.Availability and implementation MASSIF is freely available online at https://github.com/SchulzLab/MASSIF.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

Transcription factors (TFs) are proteins that bind to DNA by recognizing specific DNA sequences with tertiary protein structures, socalled DNA-binding domains (DBDs) (Luscombe et al, 2000)
We developed a tool, called MASSIF, which improves the performance of existing tools that link motifs to TFs depending on TF-associated sequences by using the DBD of a TF to calculate a domain score
We compute the domain score between the linked motif and the set of motifs associated with the DBD of the TF, which we looked up in the DBD collection

Summary

Introduction

Transcription factors (TFs) are proteins that bind to DNA by recognizing specific DNA sequences with tertiary protein structures, socalled DNA-binding domains (DBDs) (Luscombe et al, 2000). Thereby, TFs can regulate transcription by building complexes with other proteins, e.g. RNA polymerases (Reiter et al, 2017). Recent studies suggest that TFs directly influence chromatin state (Swinstead et al, 2016). TFs are involved in many functional processes, e.g. maintaining the cell cycle, preserving and establishing specific cell types as well as inducing cell death (Vaquerizas et al, 2009). Deregulation or mutations in TFs or mutations in TF-. Recognized sequences are the genetic trigger for many diseases (Deplancke et al, 2016). Further details are elaborated in Lambert et al (2018)

Methods

Results

Conclusion