Abstract

CpG islands (CGIs) are regions having high GC and CpG content while generally mammalian genomes are CpG-depleted. CGIs are often located in the promoter region of the genes, mostly housekeeping but also tissue-specific. It is widely believed that CpG dinucleotides within promoters CGIs are unmethylated and are targets for specific regulatory protein binding. As a result, CGIs contain special sequence motifs for highly affinitive protein binding (transcription factor binding sites, TFBS). Methylation of cytosine in CpG context within such motifs could decrease the affinity of TF binding, increase the attraction of methyl-binding proteins, affect the histones modification and, therefore, leads to repression of genes transcription. The mechanism of local and global transcription repression via CpG methylation is used in many different normal (development, differentiation, aging, X-chromosome inactivation, imprinting) and pathological processes (cancer and other diseases). However recently it has been reported that a class of normally methylated but active promoters do exist. Lately evidences of biological relevance of methylated CGIs or CGIs located far from gene promoters appear. Such CGIs could act as regulator for pervasive transcription, which seems to be actual genome feature rather than a side-effect of high-throughput techniques errors. Replication origins are also reported to be associated with CGIs of any location. As a consequence of specific nucleotide content, CGIs could affect DNA or RNA secondary structures. For example, G2-3C2-3 motif common within CGIs induces significant local curiosity of DNA. Another motif, G-rich sequence (GRS) in 3’ and 5’ region of RNA, is known to form specific structures, G-quadruplexes, on both end of RNA playing important role in its stability. This motif corresponds to C-rich sequence in DNA, is likely to appear in CGIs. Classical algorithms for CpG islands search use sliding window (SWM) or running sum (RSM) and several distinct but not independent criteria (GC content, Obs/ExpCpG and length). The thresholds for the criteria are rather arbitrary, unconcerned between species, and demonstrate lack of biological interpretation. SWM algorithms are rather slow, RSM algorithms are faster but tend to split large CGIs into several smaller ones and to omit CGIs with nonuniform distribution of CpG dinucleotides along the sequence. Recently, several different algorithms based on CpG dinucleotides clustering were implemented. Those algorithms have smaller number of parameters and reasonable mathematical basics. The comparison of the algorithms is tricky. Hypermutability of CpG dinucleotides lead to loss of

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call