Parallelizing SLPA for Scalable Overlapping Community Detection

Konstantin Kuzmin,Mingming Chen,Boleslaw K Szymanski

doi:10.1155/2015/461362

Abstract

Communities in networks are groups of nodes whose connections to the nodes in a community are stronger than with the nodes in the rest of the network. Quite often nodes participate in multiple communities; that is, communities can overlap. In this paper, we first analyze what other researchers have done to utilize high performance computing to perform efficient community detection in social, biological, and other networks. We note that detection of overlapping communities is more computationally intensive than disjoint community detection, and the former presents new challenges that algorithm designers have to face. Moreover, the efficiency of many existing algorithms grows superlinearly with the network size making them unsuitable to process large datasets. We use the Speaker-Listener Label Propagation Algorithm (SLPA) as the basis for our parallel overlapping community detection implementation. SLPA provides near linear time overlapping community detection and is well suited for parallelization. We explore the benefits of a multithreaded programming paradigm and show that it yields a significant performance gain over sequential execution while preserving the high quality of community detection. The algorithm was tested on four real-world datasets with up to 5.5 million nodes and 170 million edges. In order to assess the quality of community detection, at least 4 different metrics were used for each of the datasets.

Highlights

Analysis of social, biological, and other networks is a field which attracts significant attention as more and more algorithms and real-world datasets become available
Parallel label propagation (PLP) algorithm is a variation of the sequential label propagation algorithm (LPA) capable of performing detection of nonoverlapping communities in undirected weighted graphs
We further explore the multithreaded parallel programming paradigm that was used in [21] and test its performance on several real-world networks that range in size from several hundred thousand nodes and a few million edges to almost 5.5 million nodes and close to 170 million edges

Summary

Introduction

Biological, and other networks is a field which attracts significant attention as more and more algorithms and real-world datasets become available. For any given instance of a community detection problem, the total size of the problem is fixed while the number of processors varies to minimize the solution time This setting is an example of a strong scaling computing. It discusses details of the multithreaded community detection on a shared-memory multiprocessor machine along with busy-waiting techniques and implicit synchronization used to ensure correct execution. We discuss some of the limitations of the presented solution and briefly describe future work

Related Work

Parallel Linear Time Community Detection

The Quality of Community Detection

Findings

Conclusion and Future Work