Alignment-free clustering of transcription factor binding motifs using a genetic-k-medoids approach.

Pilib Ó Broin,Terry J Smith,Aaron Aj Golden

doi:10.1186/s12859-015-0450-2

Abstract

BackgroundFamilial binding profiles (FBPs) represent the average binding specificity for a group of structurally related DNA-binding proteins. The construction of such profiles allows the classification of novel motifs based on similarity to known families, can help to reduce redundancy in motif databases and de novo prediction algorithms, and can provide valuable insights into the evolution of binding sites. Many current approaches to automated motif clustering rely on progressive tree-based techniques, and can suffer from so-called frozen sub-alignments, where motifs which are clustered early on in the process remain ‘locked’ in place despite the potential for better placement at a later stage. In order to avoid this scenario, we have developed a genetic-k-medoids approach which allows motifs to move freely between clusters at any point in the clustering process.ResultsWe demonstrate the performance of our algorithm, GMACS, on multiple benchmark motif datasets, comparing results obtained with current leading approaches. The first dataset includes 355 position weight matrices from the TRANSFAC database and indicates that the k-mer frequency vector approach used in GMACS outperforms other motif comparison techniques. We then cluster a set of 79 motifs from the JASPAR database previously used in several motif clustering studies and demonstrate that GMACS can produce a higher number of structurally homogeneous clusters than other methods without the need for a large number of singletons. Finally, we show the robustness of our algorithm to noise on multiple synthetic datasets consisting of known motifs convolved with varying degrees of noise.ConclusionsOur proposed algorithm is generally applicable to any DNA or protein motifs, can produce highly stable and biologically meaningful clusters, and, by avoiding the problem of frozen sub-alignments, can provide improved results when compared with existing techniques on benchmark datasets.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0450-2) contains supplementary material, which is available to authorized users.

Highlights

Familial binding profiles (FBPs) represent the average binding specificity for a group of structurally related DNA-binding proteins
Biasing the search to Transcription factors (TFs) from a particular structural family, or providing a way to filter out spurious patterns and thereby increasing sensitivity [3,5], ii) they can be used to classify novel binding proteins based on their similarity to the binding affinities of known structural families [6,7], iii) they can be used to reduce redundancy in motif databases where minor variations or submotifs from the same binding site are incorrectly labelled as separate motifs; this redundancy reduction can be applied to motif finding algorithms, either to merge similar motif predictions from a single algorithm or to combine results from multiple algorithms [8,9], and iv) they can be used to analyze binding site turnover and provide insights into how DNAbinding mechanisms have evolved over time [10]
Column scoring for alignmentbased techniques can be based on metrics such as sum squared distance (SSD), Pearson’s correlation coefficient (PCC), and average Kullback-Leibler (AKL) distance, many of which have previously been examined in detail [9,10]

Summary

Results

Motif comparison Our first dataset consists of 355 motifs from the six largest structural families in the TRANSFAC [31] database and has previously been used by [7,11,12], and [10] to benchmark retrieval accuracy. The second TRP subfamily is comprised of the IRF1 and IRF2 motifs While both STAMP and MoSta group these two motifs with the four DOF zinc-finger motifs as a single heterogeneous cluster, GMACS instead creates two homogeneous clusters. GMACS incorrectly clusters a single forkhead motif, FOXL1, with the five members of the MADS family whereas STAMP and MoSta maintain the MADS group as a homogeneous cluster. Once these increasingly noisy datasets had been generated, the clustering process was repeated ten times for each set and the resulting range of cluster homogeneity at each level of random signal incorporation was examined.

Conclusions

Background

Discussion

Conclusion