Abstract

Abstract Gene expression analysis of single samples shows increasing promise for clinical applications. However, obtaining high quality RNA from a human tumor sample can be challenging because medical, surgical, and pathological requirements often lead to sparse or degraded RNA. The variability in RNA quality presents challenges for defining input sample requirements, which are required to calculate sensitivity, specificity and reference ranges as required for a Clinical Laboratory Improvement Amendments (CLIA)-approved test. Clinical analysis of a single RNA-Seq dataset for the purpose of gene expression profiling involves not only the patient's sample, but a comparison cohort. We use 12,236 total tumor samples and require at least 20 samples for within-disease comparisons. Many of these samples do not have associated metadata about the quality of the sample, and so we have prioritized quality measures that can be derived from the sequence data alone. In order to characterize variability present in RNA-Seq datasets, we analyzed paired-end Illumina RNA sequencing (RNA-Seq) data from 1088 tumor samples from 29 data providers. We categorized reads based on where and how well they map to the genome, as well as by their PCR duplicate status. We defined reference ranges for five types of reads found in sequencing data: unmapped (0-13%); multi-mapped (2-15%); mapped duplicate (2-66%); mapped non exonic (0-26%) and mapped, exonic, non-duplicate (MEND, 27-76%). Only 64% of the 1088 tumor samples had read type fractions within the reference ranges. Of the remainder, most exceeded the reference ranges of more than one type of read. We then measured the relationship of sensitivity and specificity to input MEND read depth. We subsampled 5 deeply sequenced samples. With each subsample, we identified exceptionally highly expressed genes and samples with similar gene expression profiles. With subsampling to 20 million MEND reads, we detected over-expressed genes (“up-outlier” genes) with a median sensitivity of 96.1% and specificity of 99.8%; sample similarity had 96.6% sensitivity and 100.0% specificity. We estimate that a sample sequenced to a depth of 70 million total reads will typically have sufficient data for the up-outlier and sample-similarity gene expression analysis assays described here. With this analysis, we have identified a conservative approach to measuring the quality of RNA-Seq read data, which can then be used to define the sensitivity and specificity of single-sample assays to support their ultimate clinical adoption. Citation Format: Holly C. Beale, Jacquelyn M. Roger, Matthew A. Cattle, Liam T. McKay, Katrina Learned, Geoff Lyle, Ellen T. Kephart, Rob Currie, Du Linh Lam, Lauren Sanders, Jacob Pfeil, John Vivian, Isabel Bjork, Sofie R. Salama, David Haussler, Olena M. Vaske. Determining accuracy of RNA sequencing data for gene expression profiling of single samples [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 5464.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.