Abstract

Cancer cell lines (CCL) are an integral part of modern cancer research but are susceptible to misidentification. The increasing popularity of sequencing technologies motivates the in-silico identification of CCLs based on their mutational fingerprint, but care must be taken when identifying heterogeneous data. We recently developed the proof-of-concept Uniquorn 1 method which could reliably identify heterogeneous sequencing data from selected sequencing technologies. Here we present Uniquorn 2, a generic and robust in-silico identification method for CCLs with DNA/RNA-seq and panel-seq information. We benchmarked Uniquorn 2 by cross-identifying 1612 RNA and 3596 panel-sized NGS profiles derived from 1516 CCLs, five repositories, four technologies and three major cancer panel-designs. Our method achieves an accuracy of 96% for RNA-seq and 95% for mixed DNA-seq and RNA-seq identification. Even for a panel of only 94 cancer-related genes, accuracy remains at 82% but decreases when using smaller panels. Uniquorn 2 is freely available as R-Bioconductor-package ‘Uniquorn’.

Highlights

  • Cancer Cell Lines (CCLs) are a critical tool for cancer researchers which facilitate the reproduction of biological experiments, help investigate cancer etiology and aid in the functional characterization and validation of driver mutations

  • Uniquorn 2 is optimized for the identification of CCLs whose variant profiles were obtained by heterogeneous technologies and diverging computational processing pipelines

  • It complements established methods by addressing some of their key limitations: 1) The physical CCL sample is not required, as it is, for instance, in the case of Short-Tandem Repeat (STR)-based identification, 2) Uniquorn 2 is agnostic to sequencing technology and able to reuse data provided by the creators of CCL libraries

Read more

Summary

Introduction

Cancer Cell Lines (CCLs) are a critical tool for cancer researchers which facilitate the reproduction of biological experiments, help investigate cancer etiology and aid in the functional characterization and validation of driver mutations. An increasingly attractive alternative or complement to such experiments is the in-silico identification of CCLs based on features of their DNA or RNA sequence[5,16,17] In this setting, only the sequence information of the to-be-identified CCL (termed query) and CCLs of a reference-collection (termed reference library) are used. In practice such an approach can be difficult, as sequencing scope, method and the processing technology used to obtain the features of the reference library are often different from those of the query CCL, leading to notable www.nature.com/scientificreports/. Uniquorn 1’s statistical model was designed for comparing features derived from whole exome sequences It cannot be applied if, for instance, the reference CCL were exome sequenced, but only the transcriptome or only a panel of genes of the query CCL is available

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.