Parsing Social Network Survey Data from Hidden Populations Using Stochastic Context-Free Grammars

Art F Y Poon,Simon D W Frost,Sergei L Kosakovsky Pond,Michelle Firestone-Cruz,Kimberly C Brouwer,Douglas D Heckathorn,Remedios M Lozada,Steffanie A Strathdee,Alison P Galvani

doi:10.1371/journal.pone.0006777

Art F Y Poon, Simon D W Frost + Show 7 more

Open Access

https://doi.org/10.1371/journal.pone.0006777

Copy DOI

Abstract

BackgroundHuman populations are structured by social networks, in which individuals tend to form relationships based on shared attributes. Certain attributes that are ambiguous, stigmatized or illegal can create a ÔhiddenÕ population, so-called because its members are difficult to identify. Many hidden populations are also at an elevated risk of exposure to infectious diseases. Consequently, public health agencies are presently adopting modern survey techniques that traverse social networks in hidden populations by soliciting individuals to recruit their peers, e.g., respondent-driven sampling (RDS). The concomitant accumulation of network-based epidemiological data, however, is rapidly outpacing the development of computational methods for analysis. Moreover, current analytical models rely on unrealistic assumptions, e.g., that the traversal of social networks can be modeled by a Markov chain rather than a branching process.Methodology/Principal FindingsHere, we develop a new methodology based on stochastic context-free grammars (SCFGs), which are well-suited to modeling tree-like structure of the RDS recruitment process. We apply this methodology to an RDS case study of injection drug users (IDUs) in Tijuana, México, a hidden population at high risk of blood-borne and sexually-transmitted infections (i.e., HIV, hepatitis C virus, syphilis). Survey data were encoded as text strings that were parsed using our custom implementation of the inside-outside algorithm in a publicly-available software package (HyPhy), which uses either expectation maximization or direct optimization methods and permits constraints on model parameters for hypothesis testing. We identified significant latent variability in the recruitment process that violates assumptions of Markov chain-based methods for RDS analysis: firstly, IDUs tended to emulate the recruitment behavior of their own recruiter; and secondly, the recruitment of like peers (homophily) was dependent on the number of recruits.ConclusionsSCFGs provide a rich probabilistic language that can articulate complex latent structure in survey data derived from the traversal of social networks. Such structure that has no representation in Markov chain-based models can interfere with the estimation of the composition of hidden populations if left unaccounted for, raising critical implications for the prevention and control of infectious disease epidemics.

Highlights

IntroductionHidden populations consist of individuals sharing one or more common attributes that are masked from public surveillance, either because they are rare, difficult to measure or define (e.g., jazz musicians [1]), or stigmatized and/or illegal (e.g., injection drug use [2])
Hidden populations consist of individuals sharing one or more common attributes that are masked from public surveillance, either because they are rare, difficult to measure or define, or stigmatized and/or illegal
stochastic contextfree grammars (SCFGs) provide a rich probabilistic language that can articulate complex latent structure in survey data derived from the traversal of social networks

Summary

Introduction

Hidden populations consist of individuals sharing one or more common attributes that are masked from public surveillance, either because they are rare, difficult to measure or define (e.g., jazz musicians [1]), or stigmatized and/or illegal (e.g., injection drug use [2]). Chain-referral sampling techniques such as ‘snowball’ sampling [3] solicit members of the hidden population to provide contact information on behalf of their peers. Conjectures from chain-referral samples are susceptible to the non-randomness of the initial sample (‘seed’ individuals), which tends to comprise the most accessible members of the hidden population (e.g., those enrolled into an institutional setting, such as a drug treatment program [4,5]). Public health agencies are presently adopting modern survey techniques that traverse social networks in hidden populations by soliciting individuals to recruit their peers, e.g., respondent-driven sampling (RDS). Current analytical models rely on unrealistic assumptions, e.g., that the traversal of social networks can be modeled by a Markov chain rather than a branching process

Methods

Results

Discussion

Conclusion