Abstract
As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm. In addition, we validate that clusters found from the proposed algorithm can represent biologically meaningful functions.
Highlights
A network is a useful data structure for quickly and efficiently managing data
Network clustering is an important analysis algorithm because the groups, which are inferred from the clustering results, enable the opportunity to understand the biological relationships between nodes that are included in the same cluster
We propose a new distributed network Clustering Algorithm based on Structure Similarity (CASS) for large-scale networks in the Apache Spark environment
Summary
A network is a useful data structure for quickly and efficiently managing data. It inherently includes several features that can be analyzed, such as clustering, shortest path, degree, and propagation. Of these, clustering is widely used to analyze network data in several research areas. For example, a network is used to describe complex relationships between biological entities. Network clustering is an important analysis algorithm because the groups, which are inferred from the clustering results, enable the opportunity to understand the biological relationships between nodes that are included in the same cluster. Zhang et al [1], for example, attempted to identify functional modules in a protein–protein
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.