High performance logistic regression for privacy-preserving genome analysis

Martine De Cock,Rafael Dowsley,Anderson C A Nascimento,Davis Railsback,Jianwei Shen,Ariel Todoki

doi:10.1186/s12920-020-00869-9

Martine De Cock, Rafael Dowsley + Show 4 more

Open Access

https://doi.org/10.1186/s12920-020-00869-9

Copy DOI

Abstract

BackgroundIn biomedical applications, valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns. Training machine learning models on the joint data without violating privacy is a major technology challenge that can be addressed by combining techniques from machine learning and cryptography. When collaboratively training machine learning models with the cryptographic technique named secure multi-party computation, the price paid for keeping the data of the owners private is an increase in computational cost and runtime. A careful choice of machine learning techniques, algorithmic and implementation optimizations are a necessity to enable practical secure machine learning over distributed data sets. Such optimizations can be tailored to the kind of data and Machine Learning problem at hand.MethodsOur setup involves secure two-party computation protocols, along with a trusted initializer that distributes correlated randomness to the two computing parties. We use a gradient descent based algorithm for training a logistic regression like model with a clipped ReLu activation function, and we break down the algorithm into corresponding cryptographic protocols. Our main contributions are a new protocol for computing the activation function that requires neither secure comparison protocols nor Yao’s garbled circuits, and a series of cryptographic engineering optimizations to improve the performance.ResultsFor our largest gene expression data set, we train a model that requires over 7 billion secure multiplications; the training completes in about 26.90 s in a local area network. The implementation in this work is a further optimized version of the implementation with which we won first place in Track 4 of the iDASH 2019 secure genome analysis competition.ConclusionsIn this paper, we present a secure logistic regression training protocol and its implementation, with a new subprotocol to securely compute the activation function. To the best of our knowledge, we present the fastest existing secure multi-party computation implementation for training logistic regression models on high dimensional genome data distributed across a local area network.

Highlights

In biomedical applications, valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns
For our largest data set, we train a model that requires over 7 billion secure multiplications and the training completes in about 26.9 s in a Local area network (LAN)
The genes in the GSE2034 data set are not labeled in a way where we can map them to the 76-gene signature to test the accuracy for a reduced number of features, but we test the runtime of training on 76 attributes and we get an average of 6.71 s, which is about a 20 s decrease from training on all 12,634 features

Summary

Introduction

Valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns. Training machine learning models on the joint data without violating privacy is a major technology challenge that can be addressed by combining techniques from machine learning and cryptography. When collaboratively training machine learning models with the cryptographic technique named secure multi-party computation, the price paid for keeping the data of the owners private is an increase in computational cost and runtime. A careful choice of machine learning techniques, algorithmic and implementation optimizations are a necessity to enable practical secure machine learning over distributed data sets. Such optimizations can be tailored to the kind of data and Machine Learning problem at hand. The iDASH competition is a yearly international competition for participants to create and implement privacy-preserving protocols for applications with genomic data. In the 2019 edition there were a total of four different tracks, where Track 4 invited participants to design MPC solutions for collaborative training of ML models originating from multiple data owners

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC medical genomics	Publication Date: Jan 20, 2021
Citations: 31	License type: open-access

R Discovery Prime

R Discovery Prime

High performance logistic regression for privacy-preserving genome analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC medical genomics

Lead the way for us

Similar Papers

SafeNet: The Unreasonable Effectiveness of Ensembles in Private Collaborative Learning
Harsh Chaudhari ... Alina Oprea
-
Harsh Chaudhari, et. al.Harsh Chaudhari ... Alina Oprea
01 Feb 2023
01 Feb 2023

Do You Consent to the Use of Your Biological Data for Training ML and AI Models? Online Survey Targeting Clinicians and Researchers.
Yury Rusinovich ... Volha Rusinovich
Web3 Journal: ML in Health Science | VOL. 1
Yury Rusinovich, et. al.Yury Rusinovich ... Volha Rusinovich
27 Jan 2024
Web3 Journal: ML in Health Science | VOL. 1

SoK: Wildest Dreams: Reproducible Research in Privacy-preserving Neural Network Training
Tanveer Khan ... Antonis Michalas
Proceedings on Privacy Enhancing Technologies | VOL. 2024
Tanveer Khan, et. al.Tanveer Khan ... Antonis Michalas
01 Jul 2024
Proceedings on Privacy Enhancing Technologies | VOL. 2024

Adaptive data augmentation for supervised learning over missing data
Tongyu Liu ... Xiaoyong Du
Proceedings of the VLDB Endowment | VOL. 14
Tongyu Liu, et. al.Tongyu Liu ... Xiaoyong Du
01 Mar 2021
Proceedings of the VLDB Endowment | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

High performance logistic regression for privacy-preserving genome analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC medical genomics