Abstract

With the rapid development of high-speed sequencing technologies and the implementation of many whole genome sequencing project, research in the genomics is advancing from genome sequencing to genome synthesis. Synthetic biology technologies such as DNA-based molecular assemblies, genome editing technology, directional evolution technology and DNA storage technology, and other cutting-edge technologies emerge in succession. Especially the rapid growth and development of DNA assembly technology may greatly push forward the success of artificial life. Meanwhile, DNA assembly technology needs a large number of target sequences of known information as data support. Non-coding DNA (ncDNA) sequences occupy most of the organism genomes, thus accurate recognizing of them is necessary. Although experimental methods have been proposed to detect ncDNA sequences, they are expensive for performing genome wide detections. Thus, it is necessary to develop machine-learning methods for predicting non-coding DNA sequences. In this study, we collected the ncDNA benchmark dataset of Saccharomyces cerevisiae and reported a support vector machine-based predictor, called Sc-ncDNAPred, for predicting ncDNA sequences. The optimal feature extraction strategy was selected from a group included mononucleotide, dimer, trimer, tetramer, pentamer, and hexamer, using support vector machine learning method. Sc-ncDNAPred achieved an overall accuracy of 0.98. For the convenience of users, an online web-server has been built at: http://server.malab.cn/Sc_ncDNAPred/index.jsp.

Highlights

  • After the implementation of many whole genome sequencing projects, more and more researches showed that non-coding DNA is a major component of the biological genome

  • The following second best prediction performance was yielded by trimer nucleotide composition (TNC) with the accuracy of 96.93%, the sensitivity of 96.62%, the specificity of 97.22%, and the Matthew correlation coefficient (MCC) of 0.939, respectively

  • The feature tetramer nucleotide composition (TrNC) was adopted as the final model for Sc-ncDNAPred

Read more

Summary

Introduction

After the implementation of many whole genome sequencing projects, more and more researches showed that non-coding DNA (ncDNA) is a major component of the biological genome. The function of most ncDNAs is still unknown(Khurana et al, 2016), some studies (Horn et al, 2013; Huang et al, 2013; Vinagre et al, 2013; Puente et al, 2015; Hu et al, 2017, 2018; Rheinbay et al, 2017; Liao et al, 2018; Zhang W. et al, 2018) have shown that most cancer-related gene mutations are located in Prediction of Non-coding DNA ncDNA regions. All the above studies require a large amount of DNA data

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call