Abstract

Abstract Background: Cancer Epidemiology Cohorts (CECs) amass vast amounts of participant health, lifestyle, environmental, genetic, and biologic data. Sharing these well-annotated participant data supports research across diverse cancer outcomes. CEC data often contain PHI, which limits use of open-access databases that have become a boon to other fields. Instead, most CECs create custom datasets for each project, which is labor intensive and hinders data sharing. Objective: The California Teachers Study (CTS), a prospective CEC of 133,477 women that began in 1995, sought to address this issue by building a user-friendly tool for researchers to independently and flexibly choose CTS data—i.e., to execute cohort selection—in a web-based platform. We also aimed to maintain existing data privacy and security; support a full range of study designs, exposures, and outcomes; and provide real-time query result visualizations. Methods: To support these computational demands, we chose an open-source column-oriented database management system optimized for fast analysis of large volumes of data. We curated and tagged participant self-reported data for improved searching and sorting. We built conditional queries that are modified by user selections, and included free text entry to handle outliers. To address privacy, the web-based tool displays summary data in aggregate and outputs the final dataset to a secure server. Users select their cancer endpoint definitions by SEER code, site group name, or ICD code; select analysis start and end points by participant-specific event type or hard-coded date; opt in or out of various censoring rules; and select self-reported data by questionnaire, topic, or search terms. After users submit the query, a folder is created for their project in the secure CTS remote desktop environment, which contains the dataset, a custom data dictionary, and starter scripts in SAS and in R. Results: Our first fully self-service data query module supports cancer cohort analyses and requires no CTS staff intervention to provision data. Two-thirds of recent CTS data requests have been cancer cohort analyses; if this continues, CTS could experience up to a 60% reduction in staff effort spent on creating data sets. This improvement enables instant access for researchers and improves data sharing for not only cancer cohort analyses, but across the full range of research projects. Discussion: A challenge to automating complex processes like cohort selection is focusing on edge cases. By focusing on automating the single most common request type (cancer cohort analyses), we immediately added value for our staff and users. The inclusion of data request modules for currently non-automatable study designs and data domains enables the tool to function as a single channel through which all data requests flow. Other CECs may also consider automating the most common aspects of their data request processes and sharing their results. Citation Format: Jennifer L. Benbow, Emma Spielfogel, Kai Lin, Sandeep Chandra, Paul Hughes, James V. Lacey. Self-serve data and cohort selection in the California Teachers Study: A web-based tool [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 894.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call