Privacy-Preserving Deep Learning NLP Models for Cancer Registries.

Lynne Penberthy,Hong-Jun Yoon,Linda Coyle,Isaac Hands,Brent Mumphrey,David Rust,Georgia Tourassi,Xiao-Cheng Wu,Jong Cheol Jeong,Mohammed Alawad,Eric B. Durbin,Shang Gao

doi:10.1109/tetc.2020.2983404

Abstract

Population cancer registries can benefit from Deep Learning (DL) to automatically extract cancer characteristics from the high volume of unstructured pathology text reports they process annually. The success of DL to tackle this and other real-world problems is proportional to the availability of large labeled datasets for model training. Although collaboration among cancer registries is essential to fully exploit the promise of DL, privacy and confidentiality concerns are main obstacles for data sharing across cancer registries. Moreover, DL for natural language processing (NLP) requires sharing a vocabulary dictionary for the embedding layer which may contain patient identifiers. Thus, even distributing the trained models across cancer registries causes a privacy violation issue. In this paper, we propose DL NLP model distribution via privacy-preserving transfer learning approaches without sharing sensitive data. These approaches are used to distribute a multitask convolutional neural network (MT-CNN) NLP model among cancer registries. The model is trained to extract six key cancer characteristics - tumor site, subsite, laterality, behavior, histology, and grade - from cancer pathology reports. Using 410,064 pathology documents from two cancer registries, we compare our proposed approach to conventional transfer learning without privacy-preserving, single-registry models, and a model trained on centrally hosted data. The results show that transfer learning approaches including data sharing and model distribution outperform significantly the single-registry model. In addition, the best performing privacy-preserving model distribution approach achieves statistically indistinguishable average micro- and macro-F1 scores across all extraction tasks (0.823,0.580) as compared to the centralized model (0.827,0.585).

Highlights

A CCURATE, timely, and comprehensive cancer monitoring is critical for assessing the population level impact of cancer and for informing populationbased cancer control policies
We explored five different transfer learning approaches: (i) transfer learning with drop embeddings model, (ii) acyclic transfer learning without privacy preserving model, (iii) cyclic transfer learning without privacy preserving model, (iv) acyclic transfer learning with privacy-preserving model, and (v) cyclic transfer learning with privacy-preserving model
Our experiments show that data and model sharing approaches among cancer registries consistently improve the performance of a multitask convolutional neural network (MT-convolutional neural network (CNN)) natural language processing (NLP) model for information extraction from cancer pathology reports as compared to the single-registry model

Summary

Introduction

A CCURATE, timely, and comprehensive cancer monitoring is critical for assessing the population level impact of cancer and for informing populationbased cancer control policies. Population cancer registries process annually large volumes of unstructured pathology reports to extract cancer characteristics such as tu-. Mumphrey and X-C Wu are with Louisiana Tumor Registry, Louisiana State University Health Sciences Center School of Public Health, New Orleans, LA, 70112. Rust are with Kentucky Cancer Registry, University of Kentucky, Lexington, KY, 40506

Methods

Results

Discussion

Conclusion