Abstract

BackgroundPrivacy protecting is an important issue in medical informatics and differential privacy is a state-of-the-art framework for data privacy research. Differential privacy offers provable privacy against attackers who have auxiliary information, and can be applied to data mining models (for example, logistic regression). However, differentially private methods sometimes introduce too much noise and make outputs less useful. Given available public data in medical research (e.g. from patients who sign open-consent agreements), we can design algorithms that use both public and private data sets to decrease the amount of noise that is introduced.MethodologyIn this paper, we modify the update step in Newton-Raphson method to propose a differentially private distributed logistic regression model based on both public and private data.Experiments and resultsWe try our algorithm on three different data sets, and show its advantage over: (1) a logistic regression model based solely on public data, and (2) a differentially private distributed logistic regression model based on private data under various scenarios.ConclusionLogistic regression models built with our new algorithm based on both private and public datasets demonstrate better utility than models that trained on private or public datasets alone without sacrificing the rigorous privacy guarantee.

Highlights

  • Data about individuals are being collected at an unprecedented speed, which brings new opportunities for scientific discovery and healthcare quality improvement

  • Logistic regression models built with our new algorithm based on both private and public datasets demonstrate better utility than models that trained on private or public datasets alone without sacrificing the rigorous privacy guarantee

  • We introduce a new distributed logistic regression model that runs on many data sets, e.g., both public and private ones

Read more

Summary

Introduction

Data about individuals are being collected at an unprecedented speed, which brings new opportunities for scientific discovery and healthcare quality improvement. There is increasing concern about people’s privacy and inappropriate disclosure of sensitive information [1]. This problem is especially challenging in biomedicine [2], where information sharing is one of the biggest pillars to facilitate meaningful analysis of complex medical data. Differential privacy offers provable privacy against attackers who have auxiliary information, and can be applied to data mining models (for example, logistic regression). Given available public data in medical research (e.g. from patients who sign open-consent agreements), we can design algorithms that use both public and private data sets to decrease the amount of noise that is introduced. Definition 3: Sensitivity A query function f ’s sensitivity under norm ||.|| is defined by

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call