Abstract

AbstractThe Office for National Statistics (ONS) is currently undertaking a substantial research program into using price information scraped from online retailers in the Consumer Prices Index including occupiers’ housing costs (CPIH). In order to make full use of these data, we must classify it into the product types that make up the basket of goods and services used in the current collection. It is a common problem that the amount of labeled training data is limited and it is either impossible or impractical to manually increase the size of the training data, as is the case with web-scraped price data. We make use of a semi-supervised machine learning (ML) method, Label Propagation, to develop a pipeline to increase the number of labels available for classification. In this work, we use several techniques in succession and in parallel to enable higher confidence in the final increased labeled dataset to be used in training a traditional ML classifier. We find promising results using this method on a test sample of data achieving good precision and recall values for both the propagated labels and the classifiers trained from these labels. We have shown that through combining several techniques together and averaging the results, we are able to increase the usability of a dataset with limited labeled training data, a common problem in using ML in real world situations. In future work, we will investigate how this method can be scaled up for use in future CPIH calculations and the challenges this brings.

Highlights

  • Research into alternative data sources for consumer price statistics is a key recommendation in the Johnson Review into UK consumer price statistics (Johnson, 2015), and the Independent Review of UK Economic Statistics (Bean, 2016)

  • Alternative data sources have the potential to improve the quality of these consumer price statistics through increased coverage, high frequency of collection, as well as potential cost savings

  • We have developed a pipeline through which it is possible to generate a large enough labeled dataset with which to train a traditional classifier using a very limited set of manually labeled products

Read more

Summary

Introduction

Research into alternative data sources for consumer price statistics is a key recommendation in the Johnson Review into UK consumer price statistics (Johnson, 2015), and the Independent Review of UK Economic Statistics (Bean, 2016). There are two sources of alternative data that we are interested in: web-scraped data and point of sale transaction data Both cover a much wider range of products and in much larger quantities than is possible with manual price collection. In November 2018, the web-scraped data contained around 900,000 price quotes for clothing products This compares to approximately 20,000 price quotes from the manual collection, a sample size increase of nearly 35 times. ONS is currently undergoing transformation across many areas of its statistics, including identifying new data sources and improving its methods. As part of this transformation, we are updating the way we collect price information to reflect our changing economy and produce more timely and granular inflation statistics for businesses, individuals, and government

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call