Abstract

Despite the abundance of species with transcriptomic data, a significant number of species still lack sequenced genomes, making it difficult to study gene function and expression in these organisms. While de novo transcriptome assembly can be used to assemble protein-coding transcripts from RNA-sequencing (RNA-seq) data, the datasets used often only feature samples of arbitrarily selected or similar experimental conditions, which might fail to capture condition-specific transcripts. We developed the Large-Scale Transcriptome Assembly Pipeline for de novo assembled transcripts (LSTrAP-denovo) to automatically generate transcriptome atlases of eukaryotic species. Specifically, given an NCBI TaxID, LSTrAP-denovo can (1) filter undesirable RNA-seq accessions based on read data, (2) select RNA-seq accessions via unsupervised machine learning to construct a sample-balanced dataset for download, (3) assemble transcripts via over-assembly, (4) functionally annotate coding sequences (CDS) from assembled transcripts and (5) generate transcriptome atlases in the form of expression matrices for downstream transcriptomic analyses. LSTrAP-denovo is easy to implement, written in Python, and is freely available at https://github.com/pengkenlim/LSTrAP-denovo/.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.