Abstract

BackgroundDNA methylation plays an important role in multiple biological processes that are closely related to human health. The study of DNA methylation can provide an insight into the mechanism behind human health and can also have a positive effect on the assessment of human health status. However, the available sequencing technology is limited by incomplete CpG coverage. Therefore, it is crucial to discover an efficient and convenient method capable of distinguishing between the states of CpG sites. Previous studies focused on identifying methylation states of the CpG sites in single cell, which only evaluated sequence information or structural information.ResultsIn this paper, we propose a novel model, LightCpG, which combines the positional features with the sequence and structural features to provide information on the CpG sites at two stages. Next, we used the LightGBM model for training of the CpG site identification, and further utilized sample extraction and merged features to reduce the training time. Our results indicate that our method achieves outstanding performance in recognition of DNA methylation. The average AUC values of our method using the 25 human hepatocellular carcinoma cells (HCC) cell datasets and six human heptoplastoma-derived (HepG2) cell datasets were 0.9616 and 0.9213, respectively. Moreover, the average training times for our method on the HCC and HepG2 datasets were 8.3 and 5.06 s, respectively. Furthermore, the computational complexity of our model was much lower compared with other available methods that detect methylation states of the CpG sites.ConclusionsIn summary, LightCpG is an accurate model for identifying the DNA methylation status of CpG sites in single cells. Furthermore, three types of feature extraction methods and two strategies used in LightCpG are helpful for other prediction problems.

Highlights

  • DNA methylation plays an important role in multiple biological processes that are closely related to human health

  • Many previous studies [13,14,15,16] have demonstrated that the sequence of neighboring nucleotides of one methylation site is specific and that the methylation state is closely related to the sequence information, which allows for the prediction of the methylation state only based on the sequence composition

  • We applied the LightGBM model to train the classifier for Dataset We downloaded two benchmark datasets Homo sapiens GM12878 (ENCFF001TLS) and heart left ventricle (ENCFF001TKC), which were extracted by reduced representation bisulfite sequencing (RRBS) from ENCODE [37, 38]

Read more

Summary

Introduction

DNA methylation plays an important role in multiple biological processes that are closely related to human health. Previous studies focused on identifying methylation states of the CpG sites in single cell, which only evaluated sequence information or structural information. DNA methylation can affect the functional state of regulatory regions and affect DNA replication and gene transcription These functions are closely related to many human diseases, including malignant tumors, Jiang et al BMC Genomics (2019) 20:306 efficient computational methods to identify DNA methylation is very important and is critical to making methylation predictions more reliable [12]. Many previous studies [13,14,15,16] have demonstrated that the sequence of neighboring nucleotides of one methylation site is specific and that the methylation state is closely related to the sequence information, which allows for the prediction of the methylation state only based on the sequence composition. Pan et al [18] employed an n-gram, multivariate mutual information [19], Discrete Wavelet Transform [20] and Pseudo Amino Acid Composition [21] to extract DNA sequence features with a window size of 100 bp

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.