Recordings from daily life contain a mix of sound types such as speech, coughing sounds, breathing sounds, and environmental noise, which provide a wealth of information related to linguistic messages, acoustic scenes, and health. These sounds in their unprocessed form overlap in both time and frequency domains, posing challenges to current methods for information extraction and analysis. To address this challenge, we introduce a novel multiclass sound event detection system that discriminates between speech, coughs, breathing, and other miscellaneous sounds (e.g., dogs barking, toilets flushing, babies crying) in the Coswara database, a crowdsourced database using a website application launched in response to COVID-19. This method extracts a feature set that includes Zero Crossing Rates, Short-Term Energy, and Mel-Frequency Cepstral Coefficients and uses Random Forests for multiclass classification. Preliminary results of the system yield a balanced accuracy of 87.5% detecting between speech, coughs, breathing, and other miscellaneous sounds. Using the multiclass sound event detection system, a time and acoustic-mediated forced-alignment technique is employed to discriminate complex sounds in real-time. We envision that the system could be used on devices with existing information extraction methods formonitoring respiratory diseases, such as pneumonia, pertussis, and COVID-19.
Read full abstract