Abstract

BackgroundRegions with abundant GC nucleotides, a high CpG number, and a length greater than 200 bp in a genome are often referred to as CpG islands. These islands are usually located in the 5′ end of genes. Recently, several algorithms for the prediction of CpG islands have been proposed.Methodology/Principal FindingsWe propose here a new method called CPSORL to predict CpG islands, which consists of a complement particle swarm optimization algorithm combined with reinforcement learning to predict CpG islands more reliably. Several CpG island prediction tools equipped with the sliding window technique have been developed previously. However, the quality of the results seems to rely too much on the choices that are made for the window sizes, and thus these methods leave room for improvement.Conclusions/SignificanceExperimental results indicate that CPSORL provides results of a higher sensitivity and a higher correlation coefficient in all selected experimental contigs than the other methods it was compared to (CpGIS, CpGcluster, CpGProd and CpGPlot). A higher number of CpG islands were identified in chromosomes 21 and 22 of the human genome than with the other methods from the literature. CPSORL also achieved the highest coverage rate (3.4%). CPSORL is an application for identifying promoter and TSS regions associated with CpG islands in entire human genomic. When compared to CpGcluster, the islands predicted by CPSORL covered a larger region in the TSS (12.2%) and promoter (26.1%) region. If Alu sequences are considered, the islands predicted by CPSORL (Alu) covered a larger TSS (40.5%) and promoter (67.8%) region than CpGIS. Furthermore, CPSORL was used to verify that the average methylation density was 5.33% for CpG islands in the entire human genome.

Highlights

  • CpG islands are short sequences that preserve a high concentration of the two nucleic acids Cytosine (C) and Guanine (G)

  • Various algorithms have been adopted in the literature to predict CpG islands, e.g., CpGIS [3], CpGPlot [4], CpGProD [5] and CpGcluster [6], but most of these tools use the sliding window technique with the GC content, O/E ratio and length thresholds as the main parameters; CpGcluster uses the distance between CpG dinculeotides

  • In this study we propose a new prediction method called CPSORL, which combines complementary particle swarm optimization (CPSO) with the reinforcement learning (RL) method to predict CpG islands in the human genome

Read more

Summary

Introduction

CpG islands are short sequences that preserve a high concentration of the two nucleic acids Cytosine (C) and Guanine (G). Since biological experiments have proven that there could be two Alu sequences in a CpG island, Takai and Jones revised the GGF criteria of CpG islands in 2002 [3]. Their modified definition requires that the minimum length of the suspected region is 500 bp and that the required GC content and O/E ratio are 55% and 0.65, respectively. Alu sequences are highly repetitive short interspersed elements with an approximate consensus sequence of about 280 bp Some of these sequences have a relative high GC content and O/E ratio [2,3]. Several algorithms for the prediction of CpG islands have been proposed

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call