Abstract

In transcription factor binding site (TFBS) motif discovery, the true width of the motif to be discovered is generally not known a priori. The ability to automatically determine the most likely width of a motif is therefore essential in motif discovery. However, this is a challenging problem as a result of the changing model dimensionality at different motif widths. Existing general model selection criteria which incorporate adjustments for dimensionality have not performed well in this context. We validate a novel heuristic for automatically determining the width of an unknown motif, based on motif containment and information content (MCOIN). Tests on synthetic data and known E. coli TFBS motifs show that MCOIN outperforms the E-value of the resulting multiple alignment as a predictor of unknown motif width at higher levels of motif conservation, based on mean absolute error. Introduction Early attempts at a heuristic function to automatically determine motif width in a deterministic algorithm were based on the likelihood ratio test (LRT). Having run the motif discovery algorithm over a range of candidate widths, the heuristic function computes a score for the discovered model at each width based on the log likelihood and the number of free parameters, before choosing the model with the highest score. However, criticisms of the LRT heuristic include the naive assumption that the EM algorithm has converged to a global maximum likelihood and the ad hoc manner in which the p-value of the LRT statistic is modified in order to account for the number of model parameters [1]. In practice, estimators based on the E-value of the resulting multiple alignment are often used. This is an approximate p-value that estimates the expected number of multiple alignments with statistical significance as great or greater than the observed alignment [2]; this is calculated for models at each candidate width and minimised to estimate motif width. Approach If it is assumed that the motif discovery algorithm discovers the motif within the dataset ‘perfectly’ at each candidate width, the algorithm Figure 1: The true motif (red) is discovered exactly when the tested width (w) is equal to the true width (w∗) and either partially discovered (w w∗) at other tested widths. discovers the true motif exactly at the true width w*. At smaller widths, only a portion of the true motif is discovered; at longer widths, the full motif is discovered along with a number of background positions (see Figure 1). If we know that models for widths w -1 and w are describing the same motif and also assume that model selection criteria (e.g. BIC) will choose the shorter model due to it having fewer free parameters, then the model with width w -1 can be removed from the set of candidate models as the width-w model also describes the same underlying motif. MCOIN implements this by using the mean root Jensen-Shannon divergence [3] per column (JSD/col) of the position weight matrix (PWM) as a measure of model similarity. If the JSD/col for two models falls below a given threshold, the shorter model is discarded in favour of the longer. The mean information content per column of the PWM is used to ensure shorter models are not discarded in favour of longer models containing background positions. The remaining model with the lowest BIC score is chosen as our best estimate of motif width. Analysis and Results MCOIN was evaluated on 20 datasets containing known E. coli TFBS motif occurrences. For each dataset, the motif discovery phase of the algorithm is run at all widths ±4 of the true motif width. Performance of a heuristic on a dataset is assessed through site-level sensitivity (sSn) and positive predictive value (sPPV ). Mean absolute error (MAE) indicates how well a heuristic estimates true motif width. MAE sSn sPPV MCOIN 1.95 0.71 0.38 E-value 2.95 0.68 0.23 Table 1: Mean absolute error (MAE), mean site-level sensitivity (sSn) and positive predictive value (sPPV) results for 20 datasets containing known E. coli TFBS motifs (mean motif conservation: 1.13 bits/col). Results of tests on datasets containing realistic synthetic motifs (not shown) and known E. coli motifs (see Table 1) show that MCOIN outperforms the E-value of the resulting multiple alignment as a predictor of unknown motif width, based on mean absolute error. MCOIN also has clear advantages over methods based on cross-validation with limited numbers of folds, as proposed in [1]. Results of experiments which removed the motif discovery phase of the algorithm show that the performance of MCOIN will improve as the performance of the core motif discovery algorithm improves (see Table 2). Conservation 2.00 1.49 1.08 0.76 0.51 (bits/col) MAE 0.00 0.00 0.00 0.01 0.07 Table 2: Mean absolute error (MAE) results for collections of 1,000 synthetic datasets where the candidate models at each width are perfect (equivalent to removing the motif discovery phase of the algorithm), at varying conservation levels. As the core motif discovery algorithm is improved, the error in the width estimated by MCOIN will decrease. Future work We have implemented MCOIN and are incorporating it within an improved TFBS motif discovery algorithm. Future work will test MCOIN on additional real data and investigate methods for optimising the required matrix manipulation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call