Uncovering Active Communities from Directed Graphs on Distributed Spark Frameworks, Case Study: Twitter Data

Veronica S Moertini,Mariskha T Adithia

doi:10.3390/bdcc5040046

Veronica S Moertini, Mariskha T Adithia

Open Access

https://doi.org/10.3390/bdcc5040046

Copy DOI

Abstract

Directed graphs can be prepared from big data containing peoples’ interaction information. In these graphs the vertices represent people, while the directed edges denote the interactions among them. The number of interactions at certain intervals can be included as the edges’ attribute. Thus, the larger the count, the more frequent the people (vertices) interact with each other. Subgraphs which have a count larger than a threshold value can be created from these graphs, and temporal active communities can then be mined from each of these subgraphs. Apache Spark has been recognized as a data processing framework that is fast and scalable for processing big data. It provides DataFrames, GraphFrames, and GraphX APIs which can be employed for analyzing big graphs. We propose three kinds of active communities, namely, Similar interest communities (SIC), Strong-interacting communities (SC), and Strong-interacting communities with their “inner circle” neighbors (SCIC), along with algorithms needed to uncover them. The algorithm design and implementation are based on these APIs. We conducted experiments on a Spark cluster using ten machines. The results show that our proposed algorithms are able to uncover active communities from public big graphs as well from Twitter data collected using Spark structured streaming. In some cases, the execution time of the algorithms that are based on GraphFrames’ motif findings is faster.

Highlights

Community detection is an increasingly popular approach to uncovering important structures in large networks [1,2,3]
All of the experiments discussed below were conducted on a Spark cluster, which is physically located in our laboratories
We looked for suitable examples from large network datasets, which are available at https://snap.stanford.edu/data/ There are many groups of datasets, such as social, citation, road, Amazon, online reviews, etc

Summary

Introduction

Community detection is an increasingly popular approach to uncovering important structures in large networks [1,2,3]. One of the most common uses for graphs today is to mine social media data, to identify cliques, recommend new connections, and suggest products and ads. The aim of community detection in graphs is to identify the groups and possibly their hierarchical organization by using only the information encoded in the graph topology [2]. This is a classic problem of finding subsets of nodes such that each subset has higher connectivity within itself than it does compared to the average connectivity of the graph as a whole, and has appeared in various forms in several other disciplines

Results

Discussion

Conclusion