Abstract
Accumulating evidence indicates that long non-coding RNAs (lncRNAs) have certain similarities with messenger RNAs (mRNAs) and are associated with numerous important biological processes, thereby demanding methods to distinguish them. Based on machine learning algorithms, a variety of methods are developed to identify lncRNAs, providing significant basic data support for subsequent studies. However, many tools lack certain scalability, versatility and balance, and some tools rely on genome sequence and annotation. In this paper, we propose a convenient and accurate tool “PreLnc”, which uses high-confidence lncRNA and mRNA transcripts to build prediction models through feature selection and classifiers. The false discovery rate (FDR) adjusted p-value and Z-value were used for analyzing the tri-nucleotide composition of transcripts of different species. Conclusions can be drawn from the experiment that there were significant differences in RNA transcripts among plants, which may be related to evolutionary conservation and the fact that plants are under evolutionary pressure for a longer time than animals. Combining with the Pearson correlation coefficient, we use the incremental feature selection (IFS) method and the comparison of multiple classifiers to build the model. Finally, the balanced random forest was used to construct the classifier, and PreLnc obtained 91.09% accuracy for 349,186 transcripts of animals and plants. In addition, by comparing standard performance measurements, PreLnc performed better than other prediction tools.
Highlights
Long non-coding RNAs, defined as a transcript with low protein-coding potential over 200 nucleotides in length, are initially considered as a “noise” of transcription because the expression level and sequence conservation of them are lower than those of messenger RNAs [1].in recent years, accumulating evidence indicates that lncRNAs exist widely in eukaryotes and are essential elements of the transcriptome [2]
Considering that classification performance remains a major concern, we proposed to add a subset of features for animals and plants, respectively, by taking tri-nucleotides as another candidate combinations, which performed well in distinguishing lncRNAs and messenger RNAs (mRNAs) [30,31]
To ensure that the model can better predict lncRNAs and mRNAs, we integrated the results of the feature selection and classifiers on animals and plants to unify the final standards of the model
Summary
Long non-coding RNAs (lncRNAs), defined as a transcript with low protein-coding potential over 200 nucleotides in length, are initially considered as a “noise” of transcription because the expression level and sequence conservation of them are lower than those of messenger RNAs (mRNAs) [1].in recent years, accumulating evidence indicates that lncRNAs exist widely in eukaryotes and are essential elements of the transcriptome [2]. Long non-coding RNAs (lncRNAs), defined as a transcript with low protein-coding potential over 200 nucleotides in length, are initially considered as a “noise” of transcription because the expression level and sequence conservation of them are lower than those of messenger RNAs (mRNAs) [1]. In the gene expression network, some lncRNAs act as important regulators, regulating the nuclear structure and transcription of the cell nucleus, mRNA stability, translation and cytoplasmic post-translational modifications [7]. It is precisely because of the various specific expressions of lncRNAs in organisms that the annotation.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.