Abstract

Predictions of node categories are commonly used to estimate homophily and other relational properties in networks. However, little is known about the validity of using predictions for this task. We show that estimating homophily in a network is a problem of predicting categories of dyads (edges) in the graph. Homophily estimates are unbiased when predictions of dyad categories are unbiased. Node-level prediction models, such as the use of names to classify ethnicity or gender, do not generally produce unbiased predictions of dyad categories and therefore produce biased homophily estimates. Bias comes from three sources: sampling bias, correlation between model errors and node degree, and correlation between node-level model errors along dyads. We examine three methods for estimating homophily: predicting node categories, predicting dyad categories, and a hybrid “ego–alter” approach. This analysis indicates that only the dyadic prediction approach is unbiased, whereas the node-level approach produces both high bias and high overall error. We find that node-level classification performance is not a reliable indicator of accuracy for homophily. Although this article focuses on a particular version of homophily, results generalize to heterophilous cases and other dyadic measures. We conclude with suggestions for research design. Code for this article is available at https://github.com/georgeberry/autocorr.

Highlights

  • Predictions of node categories are commonly used to estimate homophily and other relational properties in networks

  • We demonstrate that classification performance at the node level is not informative about bias and error in homophily estimates

  • We have examined the problem of estimating homophily when predictions must be used for node attributes

Read more

Summary

Introduction

Predictions of node categories are commonly used to estimate homophily and other relational properties in networks. Node-level prediction models, such as the use of names to classify ethnicity or gender, do not generally produce unbiased predictions of dyad categories and produce biased homophily estimates. We examine three methods for estimating homophily: predicting node categories, predicting dyad categories, and a hybrid “ego–alter” approach This analysis indicates that only the dyadic prediction approach is unbiased, whereas the node-level approach produces both high bias and high overall error. Results extend to evaluating homophily with respect to any partially observed values that are predicted or imputed from other information. This can include survey or administrative data analyzed from a network perspective. Despite the widespread usage of predictions to estimate homophily and other relational measures in a variety of empirical settings (De Choudhury 2011; Colleoni et al 2014; Bakshy et al 2015; Himelboim et al 2016; Hobbs et al 2016; Boutyline and Willer 2017; Cesare et al 2017b; Messias et al 2017), little work investigates when such predictions provide reasonable estimates (Berry et al 2018)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.