Abstract
BackgroundReliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered.ResultsTo improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-κB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies.To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA.Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies.ConclusionBased on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.
Highlights
Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data
Based on this analysis, SiteGA adds substantial specificity even to optimized Position Weight Matrix (PWM) and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and Eukaryotic Promoter Database (EPD) analysis has led to a list of genes which appear to be regulated by the above Transcription factors (TFs)
We found that motif lengths should be greater than 20 bases long for lower false positive (FP) rates, and generally that dinucleotide slightly outperformed mononucleotide PWMs, for all except SREBP
Summary
Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. Current methods to predict TFBSs are hampered by the high falsepositive rates that occur when only sequence conservation at the core binding-sites is considered. Transcription factors (TFs) function by binding to the recognition sites in gene regulatory regions. TFs are often members of multimolecular complexes to which the DNA binds through further sequence and structural features. Where TFBS sequences are known, one can try to search for similar sequences computationally. These binding sites are often represented by a consensus, which is just a pattern of bases that occur at specific positions in a site. Consensus presentation has limited use for even moderately variable BSs, because it preserves too little or no information about nucleotide variability
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.