Abstract

Probabilistic topic modelling is a machine learning technique that has recently begun to find application in the social sciences. With almost no human supervision, probabilistic topic models can infer the thematic structure of large textual datasets, making them an appealing tool for scholars in fields such as communication studies, where such datasets are increasingly common. However, topic models also present social scientists with a range of conceptual and practical challenges, many of which are yet to be satisfactorily resolved. Far from making life simpler for social scientists, the outputs of topic models can be bewildering, not only because of their complexity—a model may include dozens of topics, each of which is defined by dozens of terms—but also because of their multiplicity, since a topic model can produce not one, but infinitely many, subtly different sets of topics to describe a given dataset. Further difficulties arise from the complexity of the data itself: in social science, textual datasets often represent diverse assemblages of actors, and the meaning of the texts may depend on the circumstances of their production as much as on their textual content. Social scientists therefore must derive, interpret and employ topic models with due regard for the numerous variables that define the social context of the underlying data.This thesis proposes methods and concepts designed to address several methodological challenges inherent in the use of topic models in social science. At a broad theoretical level, it examines how contextual information can be incorporated into the interpretation and use of topic models. To this end, it proposes a methodological framework, named the DICE framework, that articulates how the four tasks of topic model derivation, interpretation, contextualisation and employment intersect with one another. Novel to this framework is the concept of topic model contextualisation, which I define as the systematic examination of how a topic model relates to contextual information.At a practical level, this thesis examines novel and under-explored methods that can assist specific tasks relating to the derivation, interpretation, and contextualisation of topic models. In relation to topic model derivation, I present two methods designed to facilitate the qualitative comparison and evaluation of candidate topic models—a need which is not adequately addressed by currently available evaluation methods which focus on quantitative metrics. In relation to the interpretation of topic models, I explore the affordances of hierarchically clustering a large number of topics to produce a more manageable number of thematically defined groupings. Although published previously, this method has received surprisingly little scholarly attention. Finally, I demonstrate a novel method for conducting a contextual analysis of a topic model. This method combines visual and numerical information in a tabular output to provide rapid and deep insights about how topics relate both individually and collectively to a variable of interest. While the methods are all demonstrated using latent Dirichlet allocation (LDA), which is the simplest but most widely used form of topic model, they may also be adapted to more recent and specialised topic modelling algorithms.I demonstrate these methods and concepts through a series of analyses of a specially compiled case study dataset, which contains 26,679 news articles and stakeholder-produced texts relating to the development of coal seam gas in Australia from 1991 to December 2015. The contents of this dataset were deliberately chosen so as to provide a collection of texts that embody the structural complexity and contextual sensitivity that is typical of textual data in social science. In the analyses, I pay particular attention to the geographic context of the discourse, adding to a presently limited literature about the geographic applications of topic models. Beyond demonstrating the utility of the methods and concepts examined, the analyses serve to highlight how contextual information can and should inform the interpretation and use of topic models. These insights are relevant not only to practitioners working with the current crop of topic models, but also to model developers who wish to create new topic models tailored to the needs of social scientists.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call