Abstract

Long non-coding RNAs (lncRNAs) play essential roles in various biological processes, such as chromatin remodeling, post-transcriptional regulation, and epigenetic modifications. Despite their critical functions in regulating plant growth, root development, and seed dormancy, the identification of plant lncRNAs remains a challenge due to the scarcity of specific and extensively tested identification methods. Most mainstream machine learning-based methods used for plant lncRNA identification were initially developed using human or other animal datasets, and their accuracy and effectiveness in predicting plant lncRNAs have not been fully evaluated or exploited. To overcome this limitation, we retrained several models, including CPAT, PLEK, and LncFinder, using plant datasets and compared their performance with mainstream lncRNA prediction tools such as CPC2, CNCI, RNAplonc, and LncADeep. Retraining these models significantly improved their performance, and two of the retrained models, LncFinder-plant and CPAT-plant, alongside their ensemble, emerged as the most suitable tools for plant lncRNA identification. This underscores the importance of model retraining in tackling the challenges associated with plant lncRNA identification. Finally, we developed a pipeline (Plant-LncPipe) that incorporates an ensemble of the two best-performing models and covers the entire data analysis process, including reads mapping, transcript assembly, lncRNA identification, classification, and origin, for the efficient identification of lncRNAs in plants. The pipeline, Plant-LncPipe, is available at: https://github.com/xuechantian/Plant-LncRNA-pipline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call