Abstract

An increasing number of technologies is recording information about human traces, offering enormous potential for analyzing human behavior. Sensor data is typically non-probability-based because not every element in the target population has a positive and known probability of being recorded. Accordingly, using such data as a primary data source for population inference is currently an active field of research. In this paper, an algorithmic population inference framework using network analysis and non-probability data is developed. This approach is demonstrated using road sensor data as the primary source to infer the Dutch freight traffic across the state road network. We interpret the Dutch state road network as a graph, with traffic junctions as vertices and state roads as edges. Road sensors are installed on a non-probability sample of edges detecting passing transport vehicles. Photographs of the license plates allow for the rare opportunity of linking sensor data with population registers. Extreme gradient boosting is applied to learn the probability of vehicle detection by a sensor from features about time, edge, vehicle and vehicle owner. Population inference is made using the learned relationship to predict the probability of detection on each day of the year, along each edge in the network for each vehicle in the population. Different data scenarios were designed to simulate the effects of the non-probability nature of the data and the extreme class imbalance. Furthermore, several performance metrics were applied. With about 27 million records and over 100 features trained and tested on an imbalanced non-probability sample, substantial variation in model performance across test sets was found. Promising results were achieved using a balanced probability sample as a control: the model performed about halfway between random guessing and perfect prediction. These results are of high practical importance because combining non-probability with administrative data is currently considered one of the most promising across several disciplines.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call