Social network integration and analysis using a generalization and probabilistic approach for privacy preservation

Xuning Tang,Christopher C. Yang

doi:10.1186/2190-8532-1-7

Abstract

Social Network Analysis and Mining (SNAM) techniques have drawn significant attention in the recent years due to the popularity of online social media. With the advance of Web 2.0 and SNAM techniques, tools for aggregating, sharing, investigating, and visualizing social network data have been widely explored and developed. SNAM is effective in supporting intelligence and law enforcement force to identify suspects and extract communication patterns of terrorists or criminals. In our previous work, we have shown how social network analysis and visualization techniques are useful in discovering patterns of terrorist social networks. Attribute to the advance of SNAM techniques, relationships among social actors can be visualized through network structures explicitly and implicit patterns can be discovered automatically. Despite the advance of SNAM, the utility of a social network is highly affected by its d completeness. Missing edges or nodes in a social network will reduce the utility of the network. For example, SNAM techniques may not be able to detect groups of social actors if some of the relationships among these social actors are not available. Similarly, SNAM techniques may overestimate the distance between two social actors if some intermediate nodes or edges are missing. Unfortunately, it is common that an organization only have a partial social network due to its limited information sources. In public safety domain, each law enforcement unit has its own criminal social network constructed by the data available from the criminal intelligence and crime database but this network is only a part of the global criminal social network, which can be obtained by integrating criminal social networks from all law enforcement units. However, due to the privacy policy, law enforcement units are not allowed to share the sensitive information of their social network data. A naive and yet practical approach is anonymizing the social network data before publishing or sharing it. However, a modest privacy gains may reduce a substantial SNAM utility. It is a challenge to make a balance between privacy and utility in social network data sharing and integration. In order to share useful information among different organizations without violating the privacy policies and preserving sensitive information, we propose a generalization and probabilistic approach of social network integration in this paper. Particularly, we propose generalizing social networks to preserve privacy and integrating the probabilistic models of the shared information for SNAM. To preserve the identity of sensitive nodes in social network, a simple approach in the literature is removing all node identities. However, it only allows us to investigate of the structural properties of such anonymized social network, but the integration of multiple anonymized social networks will be impossible. To make a balance between privacy and utility, we introduce a social network integration framework which consists of three major steps: (i) constructing generalized sub-graph, (ii) creating generalized information for sharing, and (iii) social networks integration and analysis. We also propose two sub-graph generalization methods namely, edge betweenness based (EBB) and K-nearest neighbor (KNN). We evaluated the effectiveness of these algorithms on the Global Salafi Jihad terrorist social network.

Highlights

Social Network Analysis and Mining (SNAM) techniques have drawn significant attention in the recent years due to the popularity of online social media
Problem definition Given a set of network g = {G1, G2,. . .,Gn} in a distributed setting where each organization i owns its piece of Gi, assuming the complete network GðG 1⁄4 [ni1⁄41Gi ) is unknown to each individual organization, the goal of this paper is to study how to anonymize each Gi into Gi0 so that: 1) the sensitive identities of Gi can be protected; 2) Gi0 can be shared with other organizations and the integrated anonymization graph G0 ðG0 1⁄4 [ni1⁄41Gi0 can be used for SNAM task
Evaluation As discussed before, there is no generic approach for privacy preservation since sensitive information can be defined in various ways