Abstract

Since the 90s, keyword-based search engines have been the only option for people to locate relevant web content through a simple query comprising one to a few keywords. These free or paid services operate by storing users' search queries and preferences for personal profiling and targeted ads delivery, while user-uploaded articles for plagiarism detection can further be stored as part of service providers' expanding databases for profit. In short, it has never been an option for users to search the web without revealing their queries, some of which can be sensitive, to search engine providers. Here we demonstrate that an internet search, provided with the entire article as a query, can be correctly carried out without revealing users' query content by an irreversible encoding scheme and an efficient FM-index search routine that is generally used in the next generation sequencing (NGS) of human genomes. In our solution, Sapiens Aperio Veritas Engine (S.A.V.E.), every word in the query is encoded into one of 12 "amino acids", constituting a pseudo-biological sequence (PBS) at users' local machines. The PBS-mediated plagiarism detection is carried out by users' submission of locally encoded PBSs through our cloud service to locate identical duplicates in the collected web contents, currently including all the English and Chinese Wikipedia pages and Open Access journal articles, as of April 2021, which had been encoded in the same way as the query. It is found that PBSs with a length longer than 12, comprising a combination of more than 12 "amino acids", can return correct results with a false positive rate <0.8%. S.A.V.E. runs at a similar genome-mapping speed as Bowtie and is >5 orders faster than BLAST. Functioning in both regular and in-private search modes, S.A.V.E. provides a new option for efficient internet search and plagiarism detection in a compressed search space where users' confidential contents can never be revealed. We hope the reported algorithm and implementation could introduce a new paradigm for future privacy-aware search engines. S.A.V.E. is currently running at https://dyn.life.nthu.edu.tw/SAVE/.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call