Abstract

BackgroundThe prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-med big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of lncRNAs.ResultsWe developed the lncRNA prediction method by integrating information-entropy-based features and machine learning algorithms. We calculate generalized topological entropy and generate 6 novel features for lncRNA sequences. By employing these 6 features and other features such as open reading frame, we apply supporting vector machine, XGBoost and random forest algorithms to distinguish human lncRNAs. We compare our method with the one which has more K-mer features and results show that our method has higher area under the curve up to 99.7905%.ConclusionsWe develop an accurate and efficient method which has novel information entropy features to analyze and classify lncRNAs. Our method is also extendable for research on the other functional elements in DNA sequences.

Highlights

  • The prediction of long non-coding RNA has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs

  • Feature selection by XGBoost Random forest and XGBoost both belong to decision tree algorithms

  • The decision tree algorithm calculates the information entropy gain that can be obtained by dividing a certain feature before each split, and automatically selects the feature that can maximize the information entropy gain for division

Read more

Summary

Introduction

The prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-med big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of lncRNAs. According to the central dogma of molecular biology, genetic information is stored in protein-coding genes [1]. There is increasing evidence shows that non-coding RNAs play a key role in a variety of basic and important biological processes [3]. The proportion of non-protein coding sequences increases with the complexity of the organism [4]. Noncoding RNAs can be further divided into short non-coding RNAs and long non-coding

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.