Abstract

The National Cancer Institute's Surveillance, Epidemiology, and End Results Program releases research files of cancer registry data. These files include geographic information at the county level, but no finer. Access to finer geography, such as census tract identifiers, would enable richer analyses-for example, examination of health disparities across neighborhoods. To date, tract identifiers have been left off the research files because they could compromise the confidentiality of patients' identities. We present an approach to inclusion of tract identifiers based on multiply imputed, synthetic data. The idea is to build a predictive model of tract locations, given patient and tumor characteristics, and randomly simulate the tract of each patient by sampling from this model. For the predictive model, we use multivariate regression trees fitted to the latitude and longitude of the population centroid of each tract. We implement the approach in the registry data from California. The method results in synthetic data that reproduce a wide range (but not all) of analyses of census tract socioeconomic cancer disparities and have relatively low disclosure risks, which we assess by comparing individual patients' actual and synthetic tract locations. We conclude with a discussion of how synthetic data sets can be used by researchers with cancer registry data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.