Abstract

Abstract Library preparation protocols for RNA sequencing (RNA-Seq) libraries vary and affect the downstream analysis of RNA-Seq data. Available RNA sequencing datasets often lack library construction protocol information in the metadata. This information is necessary to be able to compare RNA sequencing datasets appropriately. For example, non-polyadenylated transcripts are measured when a ribosomal RNA-depletion method is used (riboD library preparation), but not when transcripts are selected by poly(A) tails (polyA library preparation). Without knowing the library preparation method, it would appear that the non-polyadenylated transcripts were expressed at much higher levels in the samples prepared via riboD. In order to tackle this issue, we have developed a Random Forest classifier that can delineate between riboD and polyA RNA-Seq datasets. A grid search method was applied to the number of trees, maximum depth, and the minimum datasets included in each leaf node, which determined the best parameters to be 100, 8, and 1 respectively. However, applying the model on the Pediatric Brain Tumor Atlas (PBTA) showed strong overfitting for polyA samples. We examined the performance of the model investigating each maximum depth increment from 1 to 8, and we determined that the best performance on our validation set was achieved with maximum depth of 1. We subsequently proceeded to train our classifier on our own curated compendiums of pediatric cancer polyA and riboD samples including 188 and 264 samples respectively, after balancing the two datasets in terms of disease prevalence, and selecting the top 5,000 most variable genes as the default input dimensionality of the model. For samples whose genes do not exactly match the predetermined 5,000 genes, we substitute the expression of the missing gene with the mean expression of the gene observed in the training data. We show that it achieves 100% classification accuracy of samples to their respective library preparation protocols in GTEX (all polyA), CCLE (all polyA), and for 7 of 9 SRA projects. Five of the SRA projects contained only riboD datasets, 3 contained only all polyA datasets, and one was 50% riboD, and 50% polyA. Notably the SRA datasets are not all cancer related datasets, showing the power of our model to distinguish between library preparation protocols in vastly different settings. The model will become available on Docker so that it is readily, and easily accessible for application on new samples. Our model serves as an important step towards robust library preparation identification. Including samples in the training procedure from diverse contexts would make our model more widely applicable. Need summary statement here and/or future work statement Citation Format: Ioannis Anastopoulos, Holly Beale, Geoff Lyle, Allison Cheney, Olena M. Vaske, Joshua M. Stuart. Detection of RNA-Seq library preparation type via random forest [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 2287.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call