Applying traditional clustering techniques to big data on the cloud while preserving the privacy of the data is a challenge due to the required division and exponential operations in each iteration, which complicate its implementation on encrypted data. Several existing approaches are based on approximating the formulas of centers, weights, and memberships as three polynomial functions according to the multivariate Taylor formula. However, they usually suffer an increase in complexity and a slight drop in accuracy. In this paper, a novel Privacy-Preserving semi-fuzzy clustering algorithm based on the possibilistic paradigm, termed PPS-FPCM, is presented. Its main feature is that it avoids exponentiation and division operations, at each iteration, without losing accuracy. By restricting the typicality to an ordered set of discrete values between zero and one decided by the data owner (DO), the computation is simplified. The second key idea is the use of this soft typicality to detect outliers and compute the corresponding semi-fuzzy memberships, which is used to increase the in-between cluster distance. However, the initial typicality requires a magnitude relation comparison, which is still difficult to do over encrypted data. In this research study, we show how the existing incomplete re-encryption method can be used to tackle this problem. In each iteration, centers and distances to the new centers are computed on a calculator cloud server (CaCS) which is responsible for storing the cipher texts of the (DO)’s data and processing them. Then, CaCS sends the incompletely re-encrypted difference between these distances and iteratively updated bin values that correspond to the discrete possibilistic memberships that are initially decided by the (DO) to the comparator cloud server (CoCS). CoCS decrypts the difference and returns the results of comparisons. When CaCS receives the results of comparison from CoCS, it decides on an appropriate soft typicality or resends the difference of the same distance to another bin value. The required number of comparisons is O(log the number of bins). CaCS iteratively computes the corresponding semi-fuzzy memberships, computes the refined memberships, and updates the centers. In the end, CaCS sends the final soft memberships and centers to the (DO). The proposed algorithm is applicable to normal data and homomorphically encrypted data, is more effective than several related algorithms, and can produce accurate results using large enough (16 or more) discrete values with a high reduction on runtime as the number of comparisons is much less complex than exponential and division operations with added communication cost between CaCS and CoCS.
Read full abstract