Abstract

Cloud-based enterprise search services (e.g., Amazon Kendra) are enchanting to big data owners by providing them with convenient search solutions over their enterprise big datasets. However, individuals and businesses dealing with confidential big data (e.g., criminal reports) are reluctant to fully embrace such cloud services due to valid data privacy concerns. Solutions based on client-side encryption have been developed to mitigate these concerns. Nonetheless, such solutions hinder data processing, especially, data clustering, which is pivotal in applications such as real-time search on large corpora (e.g., big datasets). To cluster encrypted big data, we propose privacy-preserving clustering schemes, called ClusPr, for three forms of unstructured datasets, namely static, semi-dynamic, and dynamic. ClusPr functions based on statistical characteristics of the datasets to: (A) determine the suitable number of clusters; (B) populate the clusters with topically relevant tokens; and (C) adapt the cluster set based on the dynamism of the underlying dataset. Experimental results, obtained from evaluating ClusPr against other schemes in the literature, on three different test datasets demonstrate between 30 % $$ 30\% $$ and 60 % $$ 60\% $$ improvement on the cluster coherency. Moreover, we notice that employing ClusPr within a privacy-preserving enterprise search system can reduce the search time by up to 78 % $$ 78\% $$ , while improving the search accuracy by up to 35 % $$ 35\% $$ .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.