Big data in oncology: Challenges and solutions.

Neil J Shah,Theodora Bakker,Jonathan E Rosenberg,Michael Moran,Adam Watson,Chelsea Nichols,Sene Martin,Peter D Stetson,Katherine Panageas,Ibrahim Shah,John Philip,Andrew Niederhausern,Gregory J Riely,Jasme Lee,Hamza Ahmad,Nadia Bahadur,Gianna West,Alysha Insinga,Samantha Brown

doi:10.1200/jco.2023.41.16_suppl.6513

Abstract

6513 Background: Oncology care is complex and often multimodal. With recent technological advances, only a fraction of data is structured feasibly for research. Here we present a step-by-step method of building a novel comprehensive pan-cancer oncology data model using standard data definitions and industry-standard benchmarks. Methods: A team of 133 members was assembled including a project manager, bioinformatic engineers, business analysts, biostatisticians, data stewardship experts, clinical curators, and quality assurance (QA) managers. We first identified data domains that capture a comprehensive patient journey, leveraging existing oncology data models as a starting point, including NAACCR, PRISSMM (Deb Schrag & Eva Lepisto), and mCODE (ASCO). A common data model was developed using standard terms plus 5-10 disease specific elements (DSE). REDCap was used as the database platform, as it is HIPAA compliant and allows customizations. The data was stored in AuroraDB using an architecture and products that provide scalability from both an integration and consumption perspective. Results: We identified 10 data domains, including 186 distinct elements: demographics (20), comorbidities (2); cancer diagnosis & staging (27), pathology (45), imaging (18), medications (11), oncology responses (11), radiation treatments (14), cancer surgeries (11), cancer genomic (19), tumor markers (8), and vitals (8). Standard ontologies were used, including ICD-0-3 histology codes, ICD-10 comorbidities codes, CPT cancer surgeries codes, and CTCAE 5.0 for toxicities. We identified a data steward for each tumor type across medical oncology, surgery, pathology, radiology, and radiation oncology domains who aided curator training and the identification of DSE. QA managers and analysts performed 20% source data verification. In addition, we built REDCap rules (applicable across a form), and complex queries (applicable across multiple forms). To support QA and clinical engagement, interactive Tableau dashboards were constructed. In addition, timing and quality errors were monitored via Tableau dashboards at the individual curator level to provide timely feedback, leading to improved data quality and curation efficiency in real time. The Medical oncology and radiology domains were the most time-consuming, whereas Cancer diagnosis was the most difficult to curate. Conclusions: We collected genomic and phenomic data for 15,579 patients across six tumor types to date. Collecting comprehensive oncology data across tumor types is possible but requires institutional support, collaboration between clinical & informatics teams, and a dedicated QA team. [Table: see text]

Full Text