Abstract

Twitter and social media as a whole have great potential as a source of disease surveillance data however the general messiness of tweets presents several challenges for standard information extraction methods. Most deployed systems employ approaches that rely on simple keyword matching and do not distinguish between relevant and irrelevant keyword mentions making them susceptible to false positives as a result of the fact that keyword volume can be influenced by several social phenomena that may be unrelated to disease occurrence. Furthermore, most solutions are intended for a single language and those meant for multilingual scenarios do not incorporate semantic context. In this paper we experimentally examine different approaches for classifying text for epidemiological surveillance on the social web in addition we offer a systematic comparison of the impact of different input representations on performance. Specifically we compare continuous representations against one-hot encoding for word-based, class-based (ontology-based) and subword units in the form of byte pair encodings. We also go on to establish the desirable performance characteristics for multi-lingual semantic filtering approaches and offer an in-depth discussion of the implications for end-to-end surveillance.

Highlights

  • Disease surveillance methods based on Twitter surveillance typically count the volume of messages about a given disease topic as an indicator of actual disease activity via keywords such as the disease name [1,2,3]

  • Conclusion and future work The results are promising for languages closely related to the model language

  • The performance is significantly stronger on languages in the Indo-European language family to which English, the model language, belongs

Read more

Summary

Introduction

Disease surveillance methods based on Twitter surveillance typically count the volume of messages about a given disease topic as an indicator of actual disease activity via keywords such as the disease name [1,2,3]. It is important to incorporate the semantic orientation of tweets to discriminate between relevant and irrelevant mentions of given keywords as in many cases even messages that explicitly mention diseases may do so in a non-occurrence related contexts or contexts that are spatio-temporally irrelevant. There is some experimental evidence to suggest that incorporation of semantic orientation of tweets improves the end-to-end performance of prediction models for applications like nowcasting [6, 7]. In spite of this we are not currently aware of any large scale automated surveillance systems actively using semantic filtering techniques to classify messages

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call