Abstract

Abstract Clinical and population health researchers have the opportunity to create combined datasets for clinical cancer research that are vastly larger and, therefore, offer the potential to be more accurate and useful than those in traditional clinical research environments. The difficulty in creating these datasets is in cleaning and validating the data from disparate sources into an easily analyzable format. In addition, given the emphasis on cancer catchment, linking these data with other publically available data resources, e.g., census, environmental exposures, is critical for formulation the most accurate picture of your cancer patient population. Through the Cleveland Institute for Computational Biology, we have developed two data science techniques that (1) hash and match patients across systems/registries and (2) geocode patients at the individual address level. Identifying potential duplicate records is an essential step for the data cleansing as it allows for a more accurate count of unique patients, a more accurate picture of disease burden, and a better view of clinical outcomes of care over time. This identification of duplicate records is complicated by privacy rules adopted by many countries, such as the Health Insurance Portability and Accountability Act (HIPAA). Individually identifiable health information must be masked from parties that do not have viewing permission. In order to simplify the process of cleaning duplicate medical records and generating a unique patient identifier, thereby retaining HIPAA protections, we have created an easy-to-use software tool called “The De Duplicate and De Identify Research Engine” or “DeDeRE,” which includes 4 services: hashing, matching, reporting, and optimization. The DeDeRE software was developed to provide a solution for integrating and de-duplicating patient records across disparate datasets in order to allow the resulting datasets to be combined into larger sets while still protecting patient privacy. The DeDeRE software was developed in 2018 and used specifically as a test platform for the purpose of determining duplicate patients in multiple U.S. state cancer registries participating in the Surveillance, Epidemiology and End Results (SEER) program. Specifically, by hashing and matching millions of records of multiple state cancer registries, duplicate cases both intrastate and interstate were identified and adjudicated. Cancer health disparities are associated with many factors, including one’s location of residence. Location of residence can be associated with access to healthy food, transportation, access to health services, etc., some of which are known risk factors for specific types of cancer. In Ohio, we have obtained individual-level identified information from the Ohio Cancer Incidence Surveillance System (OCISS) allowing us to geocode all cancer patients in Ohio by individual address. We focus our analysis on the Cleveland Metro area and layer in other available data for our area in order to better understand differences in cancer incidence by location. Both of the data science approaches presented here are being applied in the Cleveland Metro area, allowing us to investigate important questions related to our Cleveland area cancer catchment population. Citation Format: Jill S. Barnholtz-Sloan. Data science techniques for understanding your cancer catchment area [abstract]. In: Proceedings of the AACR Special Conference on Modernizing Population Sciences in the Digital Age; 2019 Feb 19-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2020;29(9 Suppl):Abstract nr IA09.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call