Abstract

Recent years have seen increased interest in the use of alternative data sources in the definition and production of official statistics and indicators for the UN Sustainable Development Goals. In this paper, we consider the application of data science to the production of official statistics, illustrating our perspective through the use of poverty targeting as an application. We show that machine learning can play a central role in the generation of official statistics, combining a variety of types of data (survey, administrative and alternative). We focus on the problem of poverty targeting using the Proxy Means Test in Indonesia, comparing a number of existing statistical and machine learning methods, then introducing new approaches in the spirit of small area estimation that utilize area-level features and data augmentation at the subdistrict level to develop more refined models at the district level, evaluating the methods on three districts in Indonesia on the problem of estimating 2020 per capita household expenditure using data from 2016–2019. The best performing method, XGBoost, is able to reduce inclusion/exclusion errors on the problem of identifying the poorest 40% of the population in comparison to the commonly used Ridge Regression method by between 4.5% and 13.9% in the districts studied.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call