PPPCT: Privacy-Preserving framework for Parallel Clustering Transcriptomics data

Ali Abbasi Tadi,Dima Alhadidi,Luis Rueda

doi:10.1016/j.compbiomed.2024.108351

Abstract

Single-cell transcriptomics data provides crucial insights into patients’ health, yet poses significant privacy concerns. Genomic data privacy attacks can have deep implications, encompassing not only the patients’ health information but also extending widely to compromise their families’. Moreover, the permanence of leaked data exacerbates the challenges, making retraction an impossibility. While extensive efforts have been directed towards clustering single-cell transcriptomics data, addressing critical challenges, especially in the realm of privacy, remains pivotal. This paper introduces an efficient, fast, privacy-preserving approach for clustering single-cell RNA-sequencing (scRNA-seq) datasets. The key contributions include ensuring data privacy, achieving high-quality clustering, accommodating the high dimensionality inherent in the datasets, and maintaining reasonable computation time for big-scale datasets. Our proposed approach utilizes the map-reduce scheme to parallelize clustering, addressing intensive calculation challenges. Intel Software Guard eXtension (SGX) processors are used to ensure the security of sensitive code and data during processing. Additionally, the approach incorporates a logarithm transformation as a preprocessing step, employs non-negative matrix factorization for dimensionality reduction, and utilizes parallel k-means for clustering. The approach fully leverages the computing capabilities of all processing resources within a secure private cloud environment. Experimental results demonstrate the efficacy of our approach in preserving patient privacy while surpassing state-of-the-art methods in both clustering quality and computation time. Our method consistently achieves a minimum of 7% higher Adjusted Rand Index (ARI) than existing approaches, contingent on dataset size. Additionally, due to parallel computations and dimensionality reduction, our approach exhibits efficiency, converging to very good results in less than 10 seconds for a scRNA-seq dataset with 5000 genes and 6000 cells when prioritizing privacy and under two seconds without privacy considerations.Availability and implementationCode and datasets availability: https://github.com/University-of-Windsor/PPPCT.

Full Text