InfoFlow: A Distributed Algorithm to Detect Communities According to the Map Equation

Park K Fung

doi:10.3390/bdcc3030042

Abstract

Formidably sized networks are becoming more and more common, including in social sciences, biology, neuroscience, and the technology space. Many network sizes are expected to challenge the storage capability of a single physical computer. Here, we take two approaches to handle big networks: first, we look at how big data technology and distributed computing is an exciting approach to big data storage and processing. Second, most networks can be partitioned or labeled into communities, clusters, or modules, thus capturing the crux of the network while reducing detailed information, through the class of algorithms known as community detection. In this paper, we combine these two approaches, developing a distributed community detection algorithm to handle big networks. In particular, the map equation provides a way to identify network communities according to the information flow between nodes, where InfoMap is a greedy algorithm that uses the map equation. We develop discrete mathematics to adapt InfoMap into a distributed computing framework and then further develop the mathematics for a greedy algorithm, InfoFlow, which has logarithmic time complexity, compared to the linear complexity in InfoMap. Benchmark results of graphs up to millions of nodes and hundreds of millions of edges confirm the time complexity improvement, while maintaining community accuracy. Thus, we develop a map equation based community detection algorithm suitable for big network data processing.

Highlights

Sized networks are becoming more and more common, including in social sciences, biology, neuroscience, and the technology space, where the number of nodes and edges may exceed millions or billions
We build on top of the map equation and InfoMap to found the distributed algorithm InfoFlow, which has improved runtime complexity and can be deployed and applied to big datasets
With a view of developing a distributed community detection algorithm, we developed discrete mathematics on the map equation to provide formulae for the modular properties for merged pairwise modules, which enabled the implementation of InfoMap algorithm on distributed computing

Summary

Introduction

Sized networks are becoming more and more common, including in social sciences, biology, neuroscience, and the technology space, where the number of nodes and edges may exceed millions or billions. In such cases, the sheer size of the network presents challenges in the processing, visualizing, understanding, or even storing the network [1,2,3]. The caveat with big data technology is that parallel algorithms have to be designed and implemented in place of the original, serial algorithms [4]. Whilst big data technology provides a means to network storage and processing, for visualization and analytical purposes, smaller sized networks are much favored. Community detection algorithms [5,6,7,8,9,10] have been an active area of research, with ample algorithms to identify network communities

Methods

Results

Conclusion