Blockchain for Privacy Preserving and Trustworthy Distributed Machine Learning in Multicentric Medical Imaging (C-DistriM)

Fadila Zerka,Hanif Gabrani-Juma,Philippe Lambin,Akshayaa Vaidyanathan,Benjamin Miraglio,Visara Urovi,Michel Dumontier,Henry C Woodruff,Sean Walsh,Ralph T H Leijenaar,Samir Barakat

doi:10.1109/access.2020.3029445

Abstract

The utility of Artificial Intelligence (AI) in healthcare strongly depends upon the quality of the data used to build models, and the confidence in the predictions they generate. Access to sufficient amounts of high-quality data to build accurate and reliable models remains problematic owing to substantive legal and ethical constraints in making clinically relevant research data available offsite. New technologies such as distributed learning offer a pathway forward, but unfortunately tend to suffer from a lack of transparency, which undermines trust in what data are used for the analysis. To address such issues, we hypothesized that, a novel distributed learning that combines sequential distributed learning with a blockchain-based platform, namely Chained Distributed Machine learning C-DistriM, would be feasible and would give a similar result as a standard centralized approach. C-DistriM enables health centers to dynamically participate in training distributed learning models. We demonstrate C-DistriM using the NSCLC-Radiomics open data to predict two-year lung-cancer survival. A comparison of the performance of this distributed solution, evaluated in six different scenarios, and the centralized approach, showed no statistically significant difference (AUCs between central and distributed models), all DeLong tests yielded $p$ -val >0.05. This methodology removes the need to blindly trust the computation in one specific server on a distributed learning network. This fusion of blockchain and distributed learning serves as a proof-of-concept to increase transparency, trust, and ultimately accelerate the adoption of AI in multicentric studies. We conclude that our blockchain-based model for sequential training on distributed datasets is a feasible approach, provides equivalent performance to the centralized approach.

Full Text