Sentence embeddings represent the meaning of a given sentence using a fixed dimensional vector. Different approaches have been proposed in the Natural Language Processing (NLP) community for learning encoders that can produce accurate sentence embeddings that perform well for diverse downstream tasks that require sentence representations. Despite these prior work focusing mainly on creating accurate sentence embeddings, how to protect the privacy of the sensitive information contained in the sentences remains an unexplored research problem. In this paper, we propose, Covering Metric Analytic Gaussian (CMAG), a covering metric Differential Privacy (DP) mechanism for sentence embeddings such that minimal random noise is added to a set of sentence embeddings produced by an encoder to protect the private information expressed in those sentences. Given a sentence embedding s, CMAG considers the Mahalanobis distance between s and the other sentence embeddings s′ in the local neighbourhood of s to determine the minimal amount of random noise that must be added to s to obtain provable metric DP guarantees. Experimental results show that the proposed DP mechanism protects private information better than previously proposed DP mechanisms, while reporting good performance in a broad range of downstream NLP tasks.
Read full abstract