Inferring the Population Mean with Second-Order Information in Online Social Networks

Saran Chen,Zhongwei Jia,Xin Lu,Zhong Liu

doi:10.3390/e20060480

Abstract

With the increasing use of online social networking platforms, online surveys are widely used in many fields, e.g., public health, business and sociology, to collect samples and to infer the population characteristics through self-reported data of respondents. Although the online surveys can protect the privacy of respondents, self-reporting is challenged by a low response rate and unreliable answers when the survey contains sensitive questions, such as drug use, sexual behaviors, abortion or criminal activity. To overcome this limitation, this paper develops an approach that collects the second-order information of the respondents, i.e., asking them about the characteristics of their friends, instead of asking the respondents’ own characteristics directly. Then, we generate the inference about the population variable with the Hansen-Hurwitz estimator for the two classic sampling strategies (simple random sampling or random walk-based sampling). The method is evaluated by simulations on both artificial and real-world networks. Results show that the method is able to generate population estimates with high accuracy without knowing the respondents’ own characteristics, and the biases of estimates under various settings are relatively small and are within acceptable limits. The new method offers an alternative way for implementing surveys online and is expected to be able to collect more reliable data with improved population inference on sensitive variables.

Highlights

Online social networking platforms, e.g., Facebook, Twitter, etc., on which users share their daily life and build social relations with others, provide a tremendous amount of data for researchers to study social phenomena and to validate the theoretical models [1,2]
Simulation with simple random sampling: We first implemented the developed methods on different networks with varying characteristics and studied the performance of the estimator developed for the simple random sampling, i.e., SEC1
An analysis of variance (ANOVA) test [41] indicated that there was no significant difference of the average biases among estimates with different average degree (p-value = 0.94)

Summary

Introduction

E.g., Facebook, Twitter, etc., on which users share their daily life and build social relations with others, provide a tremendous amount of data for researchers to study social phenomena and to validate the theoretical models [1,2]. Compared with the offline surveys such as face-to-face interviews, the online surveys are cost efficient and easy to implement through social networking platforms and can protect the privacy of respondents with the absence of the interviewers [10]. From the samples collected by popular sampling strategies, such as simple random sampling and random walk-based sampling, the population mean is easy to infer when the self-reported data of the respondents’ own characteristics are available [11,12]. When the respondents are randomly selected from the population, the population mean can be estimated by Entropy 2018, 20, 480; doi:10.3390/e20060480 www.mdpi.com/journal/entropy. When the respondents are selected via a crawler-like random walk, the population mean is typically estimated by a re-weighted correction of the nodal degree [15,16,17]

Objectives

Methods

Results

Conclusion