CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

Jungrim Kim,Sanghyun Park,Sujin Lee,Chihyun Park,Seokjong Yu,Jaemin Woo,Jeongwoo Kim,Hyerim Kim,Dongmin Seo,Mincheol Shin

doi:10.1371/journal.pone.0203670

Abstract

As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm. In addition, we validate that clusters found from the proposed algorithm can represent biologically meaningful functions.

Highlights

A network is a useful data structure for quickly and efficiently managing data
Network clustering is an important analysis algorithm because the groups, which are inferred from the clustering results, enable the opportunity to understand the biological relationships between nodes that are included in the same cluster
We propose a new distributed network Clustering Algorithm based on Structure Similarity (CASS) for large-scale networks in the Apache Spark environment

Summary

Introduction

A network is a useful data structure for quickly and efficiently managing data. It inherently includes several features that can be analyzed, such as clustering, shortest path, degree, and propagation. Of these, clustering is widely used to analyze network data in several research areas. For example, a network is used to describe complex relationships between biological entities. Network clustering is an important analysis algorithm because the groups, which are inferred from the clustering results, enable the opportunity to understand the biological relationships between nodes that are included in the same cluster. Zhang et al [1], for example, attempted to identify functional modules in a protein–protein

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PloS one	Publication Date: Oct 10, 2018
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

Regression Analysis of Asymmetric Pairs in Large-Scale Network Data
Rui Pan ... Chih-Ling Tsai
Communications in Statistics - Simulation and Computation | VOL. 40
Rui Pan, et. al.Rui Pan ... Chih-Ling Tsai
01 Nov 2011
Communications in Statistics - Simulation and Computation | VOL. 40

Regression Analysis of Asymmetric Pairs in Large-Scale Network Data
Rui Pan ... Hansheng Wang
SSRN Electronic Journal | VOL. 40
Rui Pan, et. al.Rui Pan ... Hansheng Wang
10 Oct 2011
SSRN Electronic Journal | VOL. 40

An Efficient Strategy for Large-Scale CORS Data Processing
Bolin Xiong ... Dingfa Huang
-
Bolin Xiong, et. al.Bolin Xiong ... Dingfa Huang
01 Jan 2015
01 Jan 2015

Deep Learning Based Scalable Inference of Uncertain Opinions
Xujiang Zhao ... Feng Chen
-
Xujiang Zhao, et. al.Xujiang Zhao ... Feng Chen
01 Nov 2018
01 Nov 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one