Abstract

Noncoding RNA (ncRNA) is a kind of RNA that plays an important role in many biological processes, diseases, and cancers, while cannot translate into proteins. With the development of next-generation sequence technology, thousands of novel RNAs with long open reading frames (ORFs, longest ORF length > 303 nt) and short ORFs (longest ORF length ≤ 303 nt) have been discovered in a short time. How to identify ncRNAs more precisely from novel unannotated RNAs is an important step for RNA functional analysis, RNA regulation, etc. However, most previous methods only utilize the information of sequence features. Meanwhile, most of them have focused on long-ORF RNA sequences, but not adapted to short-ORF RNA sequences. In this paper, we propose a new reliable method called NCResNet. NCResNet employs 57 hybrid features of four categories as inputs, including sequence, protein, RNA structure, and RNA physicochemical properties, and introduces feature enhancement and deep feature learning policies in a neural net model to adapt to this problem. The experiments on benchmark datasets of 8 species shows NCResNet has higher accuracy and higher Matthews correlation coefficient (MCC) compared with other state-of-the-art methods. Particularly, on four short-ORF RNA sequence datasets, specifically mouse, Saccharomyces cerevisiae, zebrafish, and cow, NCResNet achieves greater than 10 and 15% improvements over other state-of-the-art methods in terms of accuracy and MCC. Meanwhile, for long-ORF RNA sequence datasets, NCResNet also has better accuracy and MCC than other state-of-the-art methods on most test datasets. Codes and data are available at https://github.com/abcair/NCResNet.

Highlights

  • Non-coding RNA cannot translate protein, but it is involved in many crucial and essentially biological processes, such as gene expression (Wang et al, 2019), gene regulation (Deaton and Bird, 2011; Dykes and Emanueli, 2017), gene silencing (Singh et al, 2018), etc

  • We test NCResNet on an independent dataset downloaded from RefLnc (Jiang et al, 2019) research, which contains 20,364 novel longORF ncRNAs and 7,142 novel short-open-reading frame (ORF) ncRNAs assembled from real clinical samples and without overlap of the previous training and test datasets

  • The result shows that the integration of four feature categories is a compelling combination for distinguishing ncRNA from protein-coding RNAs (pcRNAs)

Read more

Summary

Introduction

Non-coding RNA (ncRNA) cannot translate protein, but it is involved in many crucial and essentially biological processes, such as gene expression (Wang et al, 2019), gene regulation (Deaton and Bird, 2011; Dykes and Emanueli, 2017), gene silencing (Singh et al, 2018), etc. The differentiation of ncRNAs from numerous unclassified sequences is time- and laborconsuming with the use of biological experimental methods (Lu et al, 2019). To accelerate the computational speed of CPC, coding potential calculator version 2 (CPC2) (Kang et al, 2017), an updated version of CPC, uses sequence intrinsic features to differentiate ncRNAs from pcRNAs by SVM. Many previous methods aim to categorize long noncoding RNAs (lncRNAs) and pcRNAs such as iSeeRNA (Sun et al, 2013a), Coding-Non-Coding Index (CNCI) (Sun et al, 2013b), PLEK (Li et al, 2014), FEELnc (Wucher et al, 2017), DeepLNC (Tripathi et al, 2016), COME (Hu et al, 2017), LncRNAnet (Baek et al, 2018), and LncFinder (Han et al, 2018). DeepLNC uses multi k-mer frequencies as features to train a deep neural network

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.