Analyzing the Impact of Personalization on Fairness in Federated Learning for Healthcare.

Yanmin Gong,Kim-Kwang Raymond Choo,Kai Zhang,Yuanxiong Guo,Tongnian Wang,Jiannan Cai

doi:10.1007/s41666-024-00164-7

Abstract

As machine learning (ML) usage becomes more popular in the healthcare sector, there are also increasing concerns about potential biases and risks such as privacy. One countermeasure is to use federated learning (FL) to support collaborative learning without the need for patient data sharing across different organizations. However, the inherent heterogeneity of data distributions among participating FL parties poses challenges for exploring group fairness in FL. While personalization within FL can handle performance degradation caused by data heterogeneity, its influence on group fairness is not fully investigated. Therefore, the primary focus of this study is to rigorously assess the impact of personalized FL on group fairness in the healthcare domain, offering a comprehensive understanding of how personalized FL affects group fairness in clinical outcomes. We conduct an empirical analysis using two prominent real-world Electronic Health Records (EHR) datasets, namely eICU and MIMIC-IV. Our methodology involves a thorough comparison between personalized FL and two baselines: standalone training, where models are developed independently without FL collaboration, and standard FL, which aims to learn a global model via the FedAvg algorithm. We adopt Ditto as our personalized FL approach, which enables each client in FL to develop its own personalized model through multi-task learning. Our assessment is achieved through a series of evaluations, comparing the predictive performance (i.e., AUROC and AUPRC) and fairness gaps (i.e., EOPP, EOD, and DP) of these methods. Personalized FL demonstrates superior predictive accuracy and fairness over standalone training across both datasets. Nevertheless, in comparison with standard FL, personalized FL shows improved predictive accuracy but does not consistently offer better fairness outcomes. For instance, in the 24-h in-hospital mortality prediction task, personalized FL achieves an average EOD of 27.4% across racial groups in the eICU dataset and 47.8% in MIMIC-IV. In comparison, standard FL records a better EOD of 26.2% for eICU and 42.0% for MIMIC-IV, while standalone training yields significantly worse EOD of 69.4% and 54.7% on these datasets, respectively. Our analysis reveals that personalized FL has the potential to enhance fairness in comparison to standalone training, yet it does not consistently ensure fairness improvements compared to standard FL. Our findings also show that while personalization can improve fairness for more biased hospitals (i.e., hospitals having larger fairness gaps in standalone training), it can exacerbate fairness issues for less biased ones. These insights suggest that the integration of personalized FL with additional strategic designs could be key to simultaneously boosting prediction accuracy and reducing fairness disparities. The findings and opportunities outlined in this paper can inform the research agenda for future studies, to overcome the limitations and further advance health equity research.

Full Text