Applying machine learning and geolocation techniques to social media data (Twitter) to develop a resource for urban planning.

Sveta Milusheva,Guadalupe Bedoya,Arianna Legovini,Robert Marty,Elizabeth Resor,Sarah Williams,Feng Chen

doi:10.1371/journal.pone.0244317

Abstract

With all the recent attention focused on big data, it is easy to overlook that basic vital statistics remain difficult to obtain in most of the world. What makes this frustrating is that private companies hold potentially useful data, but it is not accessible by the people who can use it to track poverty, reduce disease, or build urban infrastructure. This project set out to test whether we can transform an openly available dataset (Twitter) into a resource for urban planning and development. We test our hypothesis by creating road traffic crash location data, which is scarce in most resource-poor environments but essential for addressing the number one cause of mortality for children over five and young adults. The research project scraped 874,588 traffic related tweets in Nairobi, Kenya, applied a machine learning model to capture the occurrence of a crash, and developed an improved geoparsing algorithm to identify its location. We geolocate 32,991 crash reports in Twitter for 2012-2020 and cluster them into 22,872 unique crashes during this period. For a subset of crashes reported on Twitter, a motorcycle delivery service was dispatched in real-time to verify the crash and its location; the results show 92% accuracy. To our knowledge this is the first geolocated dataset of crashes for the city and allowed us to produce the first crash map for Nairobi. Using a spatial clustering algorithm, we are able to locate portions of the road network (<1%) where 50% of the crashes identified occurred. Even with limitations in the representativeness of the data, the results can provide urban planners with useful information that can be used to target road safety improvements where resources are limited. The work shows how twitter data might be used to create other types of essential data for urban planning in resource poor environments.

Highlights

We focus on road traffic crashes (RTCs)
We resolve the toponym in three steps: (1) we search for landmarks that are within 500 m of a road if it is mentioned, (2) we use the centroid of the clustered location if 90% or more of the landmarks are in a 500 m radius, or (3) we rank the landmarks by the probability of being correct using the landmark type in the truth data
We classify 52,228 tweets as crash-related out of a universe of 874,588 tweets during 2012—2020 (Panel A of Fig 2). This is based on the support vector machines (SVM) algorithm, which we find performs better than the Naive Bayes algorithm according to the F1 statistic

Summary

Introduction

We use private data to describe the urban environment that affects those populations, analyzing events reported on social media that affect people’s safety such as road traffic crashes, crime or floods. We create an algorithm that classifies transport-related tweets into geolocated RTCs for Nairobi This is done by building on existing literature to test two natural language processing algorithms to identify crash reports [20, 21], developing an improved geoparsing algorithm to extract data on crash time and location [22,23,24,25,26,27,28], and ground truthing the results. Improved algorithms can lead to faster and better geolocation of events, which would help urban planners and policy makers improve responses and better target interventions

Method

Results

Discussion