Benefits of Bias in Crawl-Based Network Sampling for Identifying Key Node Set

Sho Tsugawa,Hiroyuki Ohsaki

doi:10.1109/access.2020.2988910

Abstract

We study the problem of identifying a set of key nodes from a network when limited knowledge about its structure is available. Most studies assume complete knowledge of the given network when identifying a set of key nodes, but in current practice, networks of interest are often too huge to obtain their entire topological structures. When the complete structure of a network is not available, network sampling strategies are often used to obtain a partial structure of the network. We investigate how network sampling strategies affect the problem of identifying a key node set. Specifically, we investigate the effect of conventional network sampling strategies on the solutions found for two types of key node set identification problems: the minimum $p$ -median problem and the influence maximization problem. Our results show that when the network is obtained using crawl-based network sampling strategies, both the minimum $p$ -median and the influence maximization problems are effectively solved by simple heuristic algorithms with sampling ratios in the 10-20% range. We also find that among three conventional sampling strategies (random sampling, random walk sampling, and sample edge counts) checked in this paper, random walk sampling is the most robust strategy in terms of effectively identifying the key node sets of diverse types of networks.

Highlights

Identifying a set of key nodes in a given network is a fundamental research problem in network science research, and it has broad application [1]–[6]
We investigated how conventional network sampling strategies affect the solutions obtained for the minimum p-median (MM) and influence maximization (IM) problems, which are popular key node set identification problems, when only partial networks are known
Our results have shown the benefits of biases in crawl-based sampling strategies for the IM and MM problems

Summary

Introduction

Identifying a set of key nodes in a given network is a fundamental research problem in network science research, and it has broad application [1]–[6]. Examples of the key node set identification problem include classical problems in graph theory such as minimum p-median (MM) and minimum p-center problems [3], [7]. The influence maximization (IM) problem is another popular key node set identification problem, which is expected to be useful for so-called ‘‘viral’’ marketing in social networks [4], [8]–[15]. IM aims to identify a small set of influential nodes (called seed nodes) for which the expected size of the influence cascade triggered by the seed nodes is maximized [8]. Note that the problem of identifying k most important nodes using centrality or other metrics

Methods

Results

Discussion

Conclusion