Abstract
For a broad range of research and practical applications it is important to understand the allegiances, communities and structure of key players in society. One promising direction towards extracting this information is to exploit the rich relational data in digital social networks (the social graph). As global social networks (e.g., Facebook and Twitter) are very large, most approaches make use of distributed computing systems for this purpose. Distributing graph processing requires solving many difficult engineering problems, which has lead some researchers to look at single-machine solutions that are faster and easier to maintain. In this article, we present an approach for analyzing full social networks on a standard laptop, allowing for interactive exploration of the communities in the locality of a set of user specified query vertices. The key idea is that the aggregate actions of large numbers of users can be compressed into a data structure that encapsulates the edge weights between vertices in a derived graph. Local communities can be constructed by selecting vertices that are connected to the query vertices with high edge weights in the derived graph. This compression is robust to noise and allows for interactive queries of local communities in real-time, which we define to be less than the average human reaction time of 0.25s. We achieve single-machine real-time performance by compressing the neighborhood of each vertex using minhash signatures and facilitate rapid queries through Locality Sensitive Hashing. These techniques reduce query times from hours using industrial desktop machines operating on the full graph to milliseconds on standard laptops. Our method allows exploration of strongly associated regions (i.e., communities) of large graphs in real-time on a laptop. It has been deployed in software that is actively used by social network analysts and offers another channel for media owners to monetize their data, helping them to continue to provide free services that are valued by billions of people globally.
Highlights
Social media data provides a record of global human interactions at a scale that is hitherto unprecedented
In this article we focus on Twitter data because Twitter is the most widely used Digital Social Network (DSN) for academic research and the data is relatively easy to obtain
This work represents a technical advance leading to performance gains that are useful in practice and contains a rigorous evaluation on large social media data sets
Summary
Social media data provides a record of global human interactions at a scale that is hitherto unprecedented. We develop a robust real-time algorithm for community detection for which we focus exclusively on the properties of the graph, i.e., no metadata of the Twitter/Facebook network is required. To solve the second challenge and achieve real-time querying we use the elements of the minhash signatures as the basis to build a Locality Sensitive Hashing (LSH) data structure. The combination of minhashing and LSH allows analysts to enter an account or a set of accounts and in milliseconds receive the set of most related accounts From this set we use the minhash signatures to rapidly construct a weighted graph and apply the WALKTRAP community detection algorithm before visualizing the results [9]. The novel combination of these techniques allows our system to perform robust real-time community detection on a laptop using graphs that exceed 100 million vertices. We show that the approximations implicit in minhashing and LSH minimally degrade performance and allow querying of very large graphs in real-time
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.