Abstract

We introduce and study a set of training-free methods of an information-theoretic and algorithmic complexity nature that we apply to DNA sequences to identify their potential to identify nucleosomal binding sites. We test the measures on well-studied genomic sequences of different sizes drawn from different sources. The measures reveal the known in vivo versus in vitro predictive discrepancies and uncover their potential to pinpoint high and low nucleosome occupancy. We explore different possible signals within and beyond the nucleosome length and find that the complexity indices are informative of nucleosome occupancy. We found that, while it is clear that the gold standard Kaplan model is driven by GC content (by design) and by k-mer training; for high occupancy, entropy and complexity-based scores are also informative and can complement the Kaplan model.

Highlights

  • DNA in the cell is organised into a compact form, called chromatin [1]

  • To study the extent to which some signals contribute to the determination of nucleosome occupancy, we applied some basic transformations to the original genomic DNA sequence

  • The SW transformation captures GC content, which clearly drives most of the nucleosome occupancy, but the correlation with the RY transformation that loses all GC content is very interesting

Read more

Summary

Introduction

DNA in the cell is organised into a compact form, called chromatin [1]. One level of chromatin organisation consists in DNA wrapped around histone proteins, forming nucleosomes [2]. A nucleosome is a basic unit of DNA packaging. The location of low nucleosomal occupancy is key to understanding active regulatory elements and genetic regulation that is not directly encoded in the genome but rather in a structural layer of information. The structural organisation of DNA in the chromosomes is widely known to be heavily driven by GC content [3], notwithstanding that k-mer approaches have been discovered to increase predictive power [4,5,6]. Local and Information-theoretic approaches to genomic profiling

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.