Abstract

Long noncoding RNAs (lncRNAs) are a class of RNAs longer than 200 nt and cannot encode the protein. Studies have shown that lncRNAs can regulate gene expression at the epigenetic, transcriptional, and posttranscriptional levels, which are not only closely related to the occurrence, development, and prevention of human diseases, but also can regulate plant flowering and participate in plant abiotic stress responses such as drought and salt. Therefore, how to accurately and efficiently identify lncRNAs is still an essential job of relevant researches. There have been a large number of identification tools based on machine-learning and deep learning algorithms, mostly using human and mouse gene sequences as training sets, seldom plants, and only using one or one class of feature selection methods after feature extraction. We developed an identification model containing dicot, monocot, algae, moss, and fern. After comparing 20 feature selection methods (seven filter and thirteen wrapper methods) combined with seven classifiers, respectively, considering the correlation between features and model redundancy at the same time, we found that the WOA-XGBoost-based model had better performance with 91.55%, 96.78%, and 91.68% of accuracy, AUC, and F1_score. Meanwhile, the number of elements in the feature subset was reduced to 23, which effectively improved the prediction accuracy and modeling efficiency.

Highlights

  • Noncoding RNA refers to a functional RNA molecule that cannot be translated into a protein, in which Long noncoding RNAs (lncRNAs) is a class of ncRNA, longer than 200 nt previously considered “noise” and ignored

  • To establish a plant lncRNA identification model with strong generalization ability, we used five representative plant species: Arabidopsis thaliana, Brachypodium distachyon, Chlamydomonas reinhardtii, Physcomitrella patens, and Selaginella moellendorffii, hereinafter referred to as AT, BD, CR, PP, and SM. e positive sample data were obtained from CANTATAdb 2.0, which was an online database of 39 species of plants such as Arabidopsis thaliana, Zea mays, Oryza, and three algae [25]. e negative data were downloaded from

  • RefSeq, including nonredundant gene and protein sequences with biological significance provided by the National Center for Bioinformatics (NCBI), in which we can screen for gene sequences by species, molecular types, source databases, sequence length range, and so on

Read more

Summary

Introduction

Noncoding RNA (ncRNA) refers to a functional RNA molecule that cannot be translated into a protein, in which lncRNA is a class of ncRNA, longer than 200 nt previously considered “noise” and ignored. Until 1984, the study of lncRNAs had attracted increasing attention when Pachnis and his colleagues found the H19 gene in mice, which was the first eukaryotic lncRNA, and highly expressed during embryonic development [1]. E current researches on lncRNAs generally focus on lncRNA screening, identification, expression, and localization, so it is very necessary to accurately and efficiently screen out lncRNAs from mRNAs. ere have been already several tools, which can be used to analyze the coding potential of transcript sequences. Since lncRNAs participated in biological regulatory processes, such as transcriptional level regulation, epigenetic level regulation, and posttranscriptional level regulation, and associated with diseases [13,14,15], scholars at home and abroad mainly paid attention to lncRNAs of humans, mice, and other vertebrates, while the researches on plant lncRNAs were relatively few. Urminder Singh et al [19] developed consensus models for dicots and monocots with ten plant species

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.