Abstract

Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90%. Mobile phone data, also called CDR, are generated everytime phones are used and recorded by carriers at scale. CDR have generated groundbreaking insights in public health, official statistics, and logistics. However, the fact that most phones in developing countries are prepaid means that the data lacks key information about the user, including gender and other demographic variables. This precludes numerous uses of this data in social science and development economic research. It furthermore severely prevents the development of humanitarian applications such as the use of mobile phone data to target aid towards the most vulnerable groups during crisis. We developed a framework to extract more than 1400 features from standard mobile phone data and used them to predict useful individual characteristics and group estimates. We here present a systematic cross-country study of the applicability of machine learning for dataset augmentation at low cost. We validate our framework by showing how it can be used to reliably predict gender and other information for more than half a million people in two countries. We show how standard machine learning algorithms trained on only 10,000 users are sufficient to predict individual’s gender with an accuracy ranging from 74.3 to 88.4% in a developed country and from 74.5 to 79.7% in a developing country using only metadata. This is significantly higher than previous approaches and, once calibrated, gives highly accurate estimates of gender balance in groups. Performance suffers only marginally if we reduce the training size to 5,000, but significantly decreases in a smaller training set. We finally show that our indicators capture a large range of behavioral traits using factor analysis and that the framework can be used to predict other indicators of vulnerability such as age or socio-economic status. Mobile phone data has a great potential for good and our framework allows this data to be augmented with vulnerability and other information at a fraction of the cost.

Highlights

  • Mobile phone metadata, automatically generated by our phones and recorded at largescale by carriers, have the potential to fundamentally transform the way we fight diseases, collect official statistics, or design transportation systems

  • We explore how different socio-demographic variables of a user can be reliably predicted from mobile phone data and how this information could be used at-scale by NGOs like Flowminder, reducing the data collection cost by approximately two orders of magnitude

  • We validate the applicability of our framework in two real-world use cases and show how our method performs well in both cases

Read more

Summary

Introduction

Automatically generated by our phones and recorded at largescale by carriers, have the potential to fundamentally transform the way we fight diseases, collect official statistics, or design transportation systems. For instance, already been used to study human mobility and behavior in cities [ ], the geographical partitioning of countries [ ], and the spread of information in social networks [ ]. The potential of large-scale mobile phone data is great in developing countries. While reliable basic statistics are often missing or suffering from severe bias [ ] mobile phones are one of the fastest growing technology in the developing world with penetration rates ranging from % in Uganda to % in Ghana [ ]. Mobile phone data has, for instance, already been used to model the spreading of malaria [ ] and dengue fever [ ], and to perform real-time population density mapping [ ]. Orange made large samples of mobile phone data from Côte d’Ivoire and Sénégal available to selected researchers through their Data for Development Challenges [ ]. The United Nations called for the use of mobile phone data in support of the Sustainable Development Goals [ ]

Methods
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call