Extremely Randomized Trees With Privacy Preservation for Distributed Structured Health Data

Amin Aminifar,Fazle Rabbi,Yngve Lamo,Matin Shokri,Violet Ka I. Pun

doi:10.1109/access.2022.3141709

Abstract

Artificial intelligence and machine learning have recently attracted considerable attention in the healthcare domain. The data used by machine learning algorithms in healthcare applications is often distributed over multiple sources, for instance, hospitals or patients’ personal devices. One main difficulty lies in analyzing such data without compromising patients’ privacy and personal data, which is a primary concern in healthcare applications. Therefore, in these applications, we are interested in running machine learning algorithms over distributed data without disclosing sensitive information about the data subjects. In this paper, we propose a distributed extremely randomized trees algorithm for learning from distributed data with privacy preservation. We present the implementation of our technique (which we refer to as <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">${k}$ </tex-math></inline-formula> -PPD-ERT) on a cloud platform and demonstrate its performance based on medical data, including Heart Disease, Breast Cancer, and mental health datasets (Depresjon and Psykose datasets) associated with the Norwegian INTROducing Mental health through Adaptive Technology (INTROMAT) project.

Highlights

Artificial intelligence (AI) and automated decision-making have the potential to improve accuracy and efficiency in healthcare applications
In our preliminary study [76], we have considered the problem of privacy-preserving machine learning using the extremely randomized trees algorithm, which is only robust to two colluding parties
BACKGROUND we present a brief overview of the extremely randomized trees (ERT) algorithm and secure multi-party computation (SMC), which provide the basis for our privacy-preserving distributed machine learning framework

Summary

INTRODUCTION

Artificial intelligence (AI) and automated decision-making have the potential to improve accuracy and efficiency in healthcare applications. Previous studies consider cryptographic techniques and secure multi-party computation methods for conducting privacy-preserving data mining [23]–[25]. We build upon our previous work [28] and propose a scalable privacy-preserving framework for distributed machine learning based on the extremely randomized trees algorithm, which has a linear overhead in the number of parties and can handle missing values. We use two popular publicly available healthcare datasets for performance evaluation, i.e., the Heart Disease [29] and the Breast Cancer Wisconsin (Diagnostic) [30] datasets This data represents medical applications where missing values are present, and our algorithm is designed to handle such scenarios.

STATE OF THE ART

BACKGROUND

Result

SECURE AGGREGATION OF RESULTS FROM DATA-HOLDER PARTIES

ILLUSTRATIVE EXAMPLE

EVALUATION AND DISCUSSION

CONCLUSION