The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms

Lara Orlandic,Tomas Teijeiro,David Atienza

doi:10.1038/s41597-021-00937-4

Lara Orlandic, Tomas Teijeiro + Show 1 more

Open Access

https://doi.org/10.1038/s41597-021-00937-4

Copy DOI

Journal: Scientific Data	Publication Date: Jun 23, 2021
Citations: 158	License type: open-access

Affiliation: École Polytechnique Fédérale de Lausanne

Abstract

Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 25,000 crowdsourced cough recordings representing a wide range of participant ages, genders, geographic locations, and COVID-19 statuses. First, we contribute our open-sourced cough detection algorithm to the research community to assist in data robustness assessment. Second, four experienced physicians labeled more than 2,800 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. Finally, we ensured that coughs labeled as symptomatic and COVID-19 originate from countries with high infection rates. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises.

Highlights

Background & SummaryThe novel coronavirus disease (COVID-19), declared a pandemic by the World Health Organization on March 11, 2020, has claimed over 2.5 million lives worldwide as of March 20211
We present the COUGHVID crowdsourcing dataset, which is an extensive, publicly-available dataset of cough recordings
In order to allow users of the COUGHVID dataset to quickly exclude non-cough sounds from their analyses, we developed a classifier to determine the degree of certainty to which a www.nature.com/scientificdata

Summary

Background & Summary

The novel coronavirus disease (COVID-19), declared a pandemic by the World Health Organization on March 11, 2020, has claimed over 2.5 million lives worldwide as of March 20211. In addition to publicly providing most of our cough corpus, we have trained and open-sourced a cough detection ML model to filter non-cough recordings from the database This automated cough detection tool assists developers in creating robust applications that automatically remove non-cough sounds from their databases. The COUGHVID dataset publicly contributes over 2,800 expert-labeled coughs, all of which provide a diagnosis, severity level, and whether or not audible health anomalies are present, such as dyspnea, wheezing, and nasal congestion. Using these expert labels along with participant metadata, our dataset can be used to train models that detect a variety of participants’ information based on their cough sounds. The first step to building robust AI algorithms for the detection of COVID-19 from cough sounds is having an extensive dataset, and the COUGHVID dataset effectively meets this pressing global need

Methods

Findings

Audible dyspnea