Abstract

The estimation of disease prevalence in online search engine data (e.g., Google Flu Trends (GFT)) has received a considerable amount of scholarly and public attention in recent years. While the utility of search engine data for disease surveillance has been demonstrated, the scientific community still seeks ways to identify and reduce biases that are embedded in search engine data. The primary goal of this study is to explore new ways of improving the accuracy of disease prevalence estimations by combining traditional disease data with search engine data. A novel method, Biased Sentinel Hospital-based Area Disease Estimation (B-SHADE), is introduced to reduce search engine data bias from a geographical perspective. To monitor search trends on Hand, Foot and Mouth Disease (HFMD) in Guangdong Province, China, we tested our approach by selecting 11 keywords from the Baidu index platform, a Chinese big data analyst similar to GFT. The correlation between the number of real cases and the composite index was 0.8. After decomposing the composite index at the city level, we found that only 10 cities presented a correlation of close to 0.8 or higher. These cities were found to be more stable with respect to search volume, and they were selected as sample cities in order to estimate the search volume of the entire province. After the estimation, the correlation improved from 0.8 to 0.864. After fitting the revised search volume with historical cases, the mean absolute error was 11.19% lower than it was when the original search volume and historical cases were combined. To our knowledge, this is the first study to reduce search engine data bias levels through the use of rigorous spatial sampling strategies.

Highlights

  • Search engine data analysts (e.g., Google Flu Trends and other products of search engine query data) have made it convenient for us to track disease-related trends more effectively [1, 2]

  • Using search data for Guangdong Province in China, this article illustrates the utility of rigorous spatial sampling techniques for reducing search engine data bias levels

  • The cities in Guangdong Province fall into two categories: cities located in the Pearl River Delta (PRD) region, which are more developed, and cities located outside of the PRD (OutPRD), which are less developed

Read more

Summary

Introduction

Search engine data analysts (e.g., Google Flu Trends and other products of search engine query data) have made it convenient for us to track disease-related trends more effectively [1, 2]. The accuracy of disease tracking mechanisms that use search engine data is affected by Internet use trends, by external interferences from the media and from government policies, and by frequently updated algorithms created by search engine companies [1, 3, 4]. Such problems have manifested in Google Flu Trends, which missed the first wave of the influenza A/H1N1 pandemic in 2009 and which overestimated peak flu levels during the 2012/2013 season [2, 4]. The integrated use of search engine data with conventional data sources has been proven to increase the accuracy of disease predictions [3, 6, 10]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.