Health-related hypothesis generation using social media data

Jon Parker,Andrew Yates,Ophir Frieder,Nazli Goharian

doi:10.1007/s13278-014-0239-8

Abstract

Traditional public health surveillance, also known as syndromic surveillance, is expensive and burdensome because it relies on clinical reports authored by health professionals with considerable time and effort. Due to its preventative cost, syndromic surveillance is typically only performed for high risk concerns like influenza. Therefore, a health surveillance system that works for numerous health concerns simultaneously would be of great practical use. We present a framework that processes a stream of time-stamped social media messages. The framework produces “interest curves” that permit the generation of hypotheses regarding which health-related conditions/topics may be increasing in prevalence. We do not claim to detect an actual outbreak of a health-related condition because this framework only has access to social media messages and not a harder data source like patient records. This approach differs from other prior approaches because it is not customized to detect one particular illness (e.g., influenza) as is commonly done. The inner workings of the framework can be interpreted as a transformation that converts a signal deeply embedded in the “stream of raw tweets” domain to a signal in the “health related topics” domain. This framework’s capability is demonstrated by examining multiple interest curves related to seasonal influenza and allergies.

Full Text