Abstract

Ensembles of networks arise in many scientific fields, but there are relatively few statistical tools for inferring their generative processes, particularly in the presence of both dyadic dependence and cross-graph heterogeneity. To address this gap, we propose characterizing network ensembles via finite mixtures of exponential family random graph models (ERGMs), a class of parametric statistical models that has been successful in explicitly modeling the complex stochastic processes that govern the structure of edges in a network. Our proposed modeling framework can also be used for applications such as model-based clustering of ensembles of networks and density estimation for complex graph distributions. We develop a joint approach to estimate the number of mixture components and identify cluster-specific parameters simultaneously as well as to obtain an identified model under the Bayesian paradigm. Specifically, we develop a Metropolis-within-Gibbs algorithm to perform Bayesian inference, and estimate the number of mixture components using a strategy of deliberate overfitting with sparse priors that removes excess components during MCMC. As the true ERGM likelihood is generally intractable for model specifications with dyadic dependence terms, we consider two tractable approximations (pseudolikelihood and adjusted pseudolikelihood) to facilitate efficient statistical inference. We run simulation studies to compare the performance of these two approximations with respect to multiple metrics, showing conditions under which both are useful. We demonstrate the utility of the proposed approach using an ensemble of political co-voting networks among U.S. Senators and an ensemble of brain functional connectivity networks.

Highlights

  • Data involving ensembles of networks – that is, multiple independent networks – arise in various scientific fields, including sociology (Slaughter and Koehly, 2016; Stewart et al, 2019), neuroscience (Simpson et al, 2011; Obando and De Vico Fallani, 2017), molecular biology (Unhelkar et al, 2017; Grazioli et al, 2019), and political science (Moody and Mucha, 2013) among others

  • For K-means clustering, we find that K-means clustering on the sufficient statistics is uniformly inferior to K-means clustering on the parameter estimates (MPLE or maximum likelihood estimation (MLE)), and it is worth noting that the K-means clustering based on maximum pseudo-likelihood estimation (MPLE) is better than that of MLE when the network size is large but worse when the network size is small

  • We note that PL and adjusted pseudolikelihood (APL) yield very similar posterior predictive performance when the network size is small (n = 40) but their difference becomes considerable when the network size is large (n = 100) where we note that the posterior samples estimated from APL have an advantage in producing networks that are closer to the target distribution with respect to mean eigenvector centrality and average inverse path length, whereas the posterior samples estimated from PL have an advantage in producing networks that are closer

Read more

Summary

Introduction

Data involving ensembles of networks – that is, multiple independent networks – arise in various scientific fields, including sociology (Slaughter and Koehly, 2016; Stewart et al, 2019), neuroscience (Simpson et al, 2011; Obando and De Vico Fallani, 2017), molecular biology (Unhelkar et al, 2017; Grazioli et al, 2019), and political science (Moody and Mucha, 2013) among others. T. Butts either not posited a generative model for the parameters of the base distribution, as in descriptive meta-analytic approaches (which can be problematic when model interpretation and simulation from the resulting model are of interest), or not suitable for identifying subpopulations from heterogeneous data (as in hierarchical models without mixture structure). Butts either not posited a generative model for the parameters of the base distribution, as in descriptive meta-analytic approaches (which can be problematic when model interpretation and simulation from the resulting model are of interest), or not suitable for identifying subpopulations from heterogeneous data (as in hierarchical models without mixture structure) Work such as that of Lehmann et al (2021) enables the modeling of heterogeneity in brain functional connectivity networks, it requires that subpopulation labels be observed (which is often not the case). The ability to provide generative and interpretable models of complex network structure is an important asset of this approach, which we leverage here in the context of graph ensembles

Definition and Estimation
Size-Adjusted Parameterizations
Finite Mixtures of ERGMs
Bayesian Estimation
Approximations to Intractable Likelihoods
Identifying the Number of Clusters
Post MCMC Inference
Choosing Between Competing Model Specifications
Posterior Probability of Cluster Membership
Simulation Studies
Performance Evaluation
Simulation Design
Identification of Mixture Components
Parameter Estimation
Posterior Predictive Assessments
Computation Time
Application to Political Co-Voting Networks
Model Specification and Estimation
Application to Brain Functional Connectivity Networks
Model Specification
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call