Abstract
We are now in an era where protein–DNA interactions have been experimentally assayed for thousands of DNA-binding proteins. In order to infer DNA-binding specificities from these data, numerous sophisticated computational methods have been developed. These approaches typically infer DNA-binding specificities by considering interactions for each protein independently, ignoring related and potentially valuable interaction information across other proteins that bind DNA via the same structural domain. Here we introduce a framework for inferring DNA-binding specificities by considering protein–DNA interactions for entire groups of structurally similar proteins simultaneously. We devise both constrained optimization and label propagation algorithms for this task, each balancing observations at the individual protein level against dataset-wide consistency of interaction preferences. We test our approaches on two large, independent Cys2His2 zinc finger protein–DNA interaction datasets. We demonstrate that jointly inferring specificities within each dataset individually dramatically improves accuracy, leading to increased agreement both between these two datasets and with a fixed external standard. Overall, our results suggest that sharing protein–DNA interaction information across structurally similar proteins is a powerful means to enable accurate inference of DNA-binding specificities.
Highlights
Proteins that bind DNA in a sequence-specific manner are involved in a wide range of functions in the cell, from transcriptional regulation to recombination
If k is the length of the binding site for the protein, Sa is a 4 × k matrix where Sa[b, j] is the normalized frequency with which nucleotide b is observed in the j-th position of the aligned binding sites for protein a; Sa or Ca are usually determined by specialized computational approaches designed to analyze data for a arising from specific types of experiments
Since these corresponding position-specific weight matrix (PWM) columns reflect biologically repeated experiments, we expect high agreement; we observe that initial specificities agree for only 60% of columns, with a median per-column Pearson correlation coefficient (PCC) of 0.76
Summary
Proteins that bind DNA in a sequence-specific manner are involved in a wide range of functions in the cell, from transcriptional regulation to recombination. Since high-throughput measurements may be less accurate for some proteins than for others, we reasoned that simultaneously considering all observed interactions for large groups of proteins while considering the similarity of their interfaces would lead to more accurate estimation of DNA-binding specificities. Such an approach is of increasing value as DNA-binding interactions are continuing to be rapidly determined and systematic screens of large numbers of variants for a given DBD family are becoming more common [3,4,30,31]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have