Abstract

We describe a sequence-based computational model to predict DNA G-quadruplex (G4) formation. The model was developed using large-scale machine learning from an extensive experimental G4-formation dataset, recently obtained for the human genome via G4-seq methodology. Our model differentiates many widely accepted putative quadruplex sequences that do not actually form stable genomic G4 structures, correctly assessing the G4 folding potential of over 700,000 such sequences in the human genome. Moreover, our approach reveals the relative importance of sequence-based features coming from both within the G4 motifs and their flanking regions. The developed model can be applied to any DNA sequence or genome to characterise sequence-driven intramolecular G4 formation propensities.

Highlights

  • G-quadruplex structures (G4s) are alternative DNA conformations with an increasing body of evidence for their functional role and influence in living cells[1,2,3,4,5]

  • The overlap between G4-seq experimentally observed G4 structures and putative quadruplex sequences (PQSs), that are based on bioinformatics motif search in the human genome (Fig. 2A), indicate that simple computational methods result in many sequences that do not form stable G4s, despite possessing the canonical set of four G-tracts (Fig. 1)

  • We still noticed a large fraction of PQSs (Fig. 2) not detected as stable G4s by either class of sequence motifs (46.37% and 50.96%, for stringent Quadparser and extended PQSs respectively), while the extended PQS definition covered a greater fraction of experimentally observed G4s (65.56% for extended vs. 36.86% for stringent PQS definitions)

Read more

Summary

Introduction

G-quadruplex structures (G4s) are alternative DNA conformations with an increasing body of evidence for their functional role and influence in living cells[1,2,3,4,5]. While attempts have been made to address stability scoring in such motifs, the current models rely on considerations of simple characteristics (lengths of the G-tracts, the loop sequences, G-skewness) or biophysical measurements for short sequences that lack their wider genomic context[1,12,13,14,15,16]. The absence of large biophysical datasets for G4-forming sequences, has hitherto precluded a more complete sequence-based model for G4 stability. Given the scale of the available G4-seq dataset and the recent success of large-scale machine learning approaches in deciphering complex genomic dependencies[17,18,19], we sought to develop a machine learning procedure to build a G4-formation model based on a multitude of sequence-only features (see Methods, Supporting Information Figures S1–S9). The major challenge here is achieving a combination of high sensitivity with high specificity, which we solve here for the clearly defined and major part of the universe of G4 forming sequences

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.