Abstract

BackgroundThe origin is the starting site of DNA replication, an extremely vital part of the informational inheritance between parents and children. More importantly, accurately identifying the origin of replication has great application value in the diagnosis and treatment of diseases related to genetic information errors, while the traditional biological experimental methods are time-consuming and laborious.ResultsWe carried out research on the origin of replication in a variety of eukaryotes and proposed a unique prediction method for each species. Throughout the experiment, we collected data from 7 species, including Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, Kluyveromyces lactis, Pichia pastoris and Schizosaccharomyces pombe. In addition to the commonly used sequence feature extraction methods PseKNC-II and Base-content, we designed a feature extraction method based on TF-IDF. Then the two-step method was utilized for feature selection. After comparing a variety of traditional machine learning classification models, the multi-layer perceptron was employed as the classification algorithm. Ultimately, the data and codes involved in the experiment are available at https://github.com/Sarahyouzi/EukOriginPredict.ConclusionsThe prediction accuracy of the training set of the above-mentioned seven species after 100 times fivefold cross validation reach 92.60%, 90.80%, 91.22%, 96.15%, 96.72%, 99.86%, 96.72%, respectively. It denotes that compared with other methods, the methods we designed could accomplish superior performance. In addition, our experiments reveals that the models of multiple species could predict each other with high accuracy, and the results of STREME shows that they have a certain common motif.

Highlights

  • The origin is the starting site of DNA replication, an extremely vital part of the informational inheritance between parents and children

  • When the number of features is small, the feature selection effect based on F-score is better, and the feature selection effect based on TF-IDF is better when the feature number is increased

  • For species such as H. sapiens, M. musculus and D. melanogaster, utilizing TF-IDF can achieve the best feature selection effect, while A. thaliana, P. pastoris, S. pombe and K. lactis are more suitable for F-score

Read more

Summary

Introduction

The origin is the starting site of DNA replication, an extremely vital part of the informational inheritance between parents and children. In 2004, Corzzareli’s group [3] predicted the starting site in Saccharomyces cerevisiae by using the property of replication initiation to be rich in AT bases. In 2012, Chen et al [4] studied the replication initiation site of Saccharomyces cerevisiae by calculating the bending degree and cleavage intensity of the DNA sequence, which is highly effective for identifying positive samples. In 2019, Dao et al [8] collected a variety of eukaryotes Based on characteristics such as Kmer and SVM classifier, they conducted a complete study of each organism and made some progress. It is necessary to further promote the experiment to improve the classification accuracy

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call