Social Network Signatures: A Framework for Re-Identification in Networked Data

Shawndra Hill,Akash Nagle

doi:10.2139/ssrn.1341394

Abstract

Data on large dynamic social networks, such as telecommunications networks and the Internet, are pervasive. However, representing these networks in a manner that is conducive to efficient large-scale analysis is often a challenge. In this paper, we focus on the analysis task of re-identification. Re-identification in the context of dynamic networks is essentially a matching problem that involves comparing the behavior of networked entities across two time periods. An entity's social network behavior can be represented as a signature. A similarity score that measures the degree of overlap in signatures can be assigned to pairs of entities observed across specified time periods. The score can then be used as an attribute in a predictive model to classify pairs of entities as matching or non-matching. Prior research has reported success in the domains of e-mail alias detection, author attribution, and identifying fraudulent consumers in the telecommunications industry. In this work, we address the question of why are we able to re-identify entities on real world dynamic networks? Our contribution is two-fold. First, we address the challenge of scale with a framework for matching that does not require pair-wise comparisons to ascertain the similarity scores. We assume a random network structure to estimate performance and show that our estimates are good predictors for simulated networks with different characteristics including clustering coefficient, average degree, size, and different network types such as random, small world and scale-free. Second, we show our method is robust against missing links in the second time period but less tolerant to noise, which is modeled by changes in behavior from the first to second time period. Using our framework, we provide a performance estimate for prediction on networks based solely on their degree distribution and dynamics. This work has significant implications for re-identification problems where scale is a challenge as well as when false negatives (e.g., when fraudulent consumers are not labeled as fraudulent) cannot be observed.

Full Text