Uncovering dynamic textual topics that explain crime.

Seppo Virtanen

doi:10.1098/rsos.210750

Abstract

Crime analysis/mapping techniques have been developed and applied for crime detection and prevention to predict where and when crime occurs, leveraging historical crime records over a spatial area and covariates for the spatial domain. Some of these techniques may provide insights for understanding crime and disorder, especially, via interpreting the weights for the spatial covariates based on regression modelling. However, to date, the use of temporal covariates for the time domain has not played a significant role in the analysis. In this work, we collect time-stamped crime-related news articles, infer crime topics or themes based on the collection and associate the topics with the historical numeric crime counts. We provide a proof-of-concept study, where instead of adopting spatial covariates, we focus on temporal (or dynamic) covariates and assess their utility. We present a novel joint model tailored for the crime articles and counts such that the temporal covariates (latent variables, more generally) are inferred based on the data sources. We apply the model for violent crime in London.

Highlights

Textual streaming news articles from reputable sources provide accessible real-time detailed information about significant and prominent crime events, that affect society and insights and analyses exploring crime trends and causes and effects of crime on society
We present a statistical joint model for the data sources combining dynamic topic modelling and Poisson matrix factorization by suitably sharing latent variables between the separate models/data sources across the time stamps/windows
Each article is expressed as a set/bag of words and the model assumes the words are generated from a categorical distribution, whose expectation parameters correspond to topics

Summary

Introduction

Textual streaming news articles from reputable sources provide accessible real-time detailed information about significant and prominent crime events, that affect society and insights and analyses exploring crime trends and causes and effects of crime on society. We hypothesize that the textual time-stamped crime news articles provide a rich source of information and context that may be used to explain and predict numeric crime counts. Each article is expressed as a set/bag of words and the model assumes the words are generated from a categorical distribution, whose expectation parameters correspond to topics. Our joint model of crime news articles and counts assumes the latent variables correspond to the parameters of the Dirichlet distributions for the topic proportions for each group/time window. The latent variables indicate which topics (thematic word distributions over the vocabulary) denoted by ηk, for k = 1, ..., K, are associated with and to what degree for each group or time stamp. For the latent variables and the remaining variables, we adopt slice sampling, following Virtanen & Girolami [13]

Quantitative model comparison

Inspection of the inferred topics

Discussion