Given a set of local sequential datasets held by multiple parties, we study the problem of publishing a synthetic dataset that preserves approximate sequentiality information of the integrated dataset while satisfying differential privacy for each local dataset. The existing solutions for publishing differentially private sequential data in the centralized setting mostly adopt tree-based approaches. Such approaches rely on different tree structures that encode sequential data's statistical information. The construction of a tree structure is normally done by recursively splitting nodes whose noisy <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">scores</i> (e.g., entropy or count) are larger than a given threshold. However, extending similar ideas to the multi-party setting is challenging. First, the comparison between noisy scores and a given threshold needs to be done in a distributed manner without letting the parties know the noisy scores, while satisfying differential privacy for each local dataset. Second, in the multi-party setting the large number of node splitting decisions incurs prohibitive computation costs. In addressing the above challenges, we present <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DPST</i> , a distributed prediction suffix tree construction solution. In DPST, we first introduce a novel node splitting decision method that calculates the comparison result under encryption with substantially improved efficiency. Then we present a novel batch-based tree construction approach to reduce computation costs. In order to achieve high parallel performance without incurring any extra communication cost, we introduce the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">conjunction</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">slide</i> methods to ensure that each batch contains a stable number of carefully arranged <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">decision tasks</i> . To further reduce communication and computation costs, we propose a prefix-based pre-pruning method to reduce the number of nodes that need to be judged whether to split by an interactive protocol. Extensive experiments on real datasets demonstrate that our DPST solution offers desirable data utility with low computation and communication costs.
Read full abstract