PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic

Rana Saniei,Víctor Rodríguez Doncel

doi:10.1007/s42979-022-01097-x

Rana Saniei, Víctor Rodríguez Doncel

Open Access

https://doi.org/10.1007/s42979-022-01097-x

Copy DOI

Journal: SN Computer Science	Publication Date: Apr 6, 2022
Citations: 3	License type: open-access

Affiliation: Universidad Politécnica de Madrid

Abstract

Health-related information is considered as ‘highly sensitive’ by the European General Data Protection Regulations (GDPR) and determining whether a text document contains health-related information or not is of interest for both individuals and companies in a number of different scenarios. Although some efforts have been made to detect different categories of personal data in texts, including health information, the classification task by machines is still challenging. In this work, we aim to contribute to solving this challenge by building a corpus of tweets being shared in the current COVID-19 pandemic context. The corpus is called PHDD(Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic) and contains 1,494 tweets which have been manually tagged by three taggers in three dimensions: health-sensitivity status, categories of health information, and subject of health history. Furthermore, a lightweight ontology called PTHI(Privacy Tags for Health Information), which reuses two other vocabularies, namely hl7 and dpv, is built to represent the corpus in a machine-readable format. The corpus is publicly available and can be used by NLP experts for implementation of techniques to detect sensitive health information in textual documents.

Full Text