Carotta: Revealing Hidden Confounder Markers in Metabolic Breath Profiles.

Anne-Christin Hauschild,Tobias Frisch,Jörg Ingo Baumbach,Jan Baumbach

doi:10.3390/metabo5020344

Abstract

Computational breath analysis is a growing research area aiming at identifying volatile organic compounds (VOCs) in human breath to assist medical diagnostics of the next generation. While inexpensive and non-invasive bioanalytical technologies for metabolite detection in exhaled air and bacterial/fungal vapor exist and the first studies on the power of supervised machine learning methods for profiling of the resulting data were conducted, we lack methods to extract hidden data features emerging from confounding factors. Here, we present Carotta, a new cluster analysis framework dedicated to uncovering such hidden substructures by sophisticated unsupervised statistical learning methods. We study the power of transitivity clustering and hierarchical clustering to identify groups of VOCs with similar expression behavior over most patient breath samples and/or groups of patients with a similar VOC intensity pattern. This enables the discovery of dependencies between metabolites. On the one hand, this allows us to eliminate the effect of potential confounding factors hindering disease classification, such as smoking. On the other hand, we may also identify VOCs associated with disease subtypes or concomitant diseases. Carotta is an open source software with an intuitive graphical user interface promoting data handling, analysis and visualization. The back-end is designed to be modular, allowing for easy extensions with plugins in the future, such as new clustering methods and statistics. It does not require much prior knowledge or technical skills to operate. We demonstrate its power and applicability by means of one artificial dataset. We also apply Carotta exemplarily to a real-world example dataset on chronic obstructive pulmonary disease (COPD). While the artificial data are utilized as a proof of concept, we will demonstrate how Carotta finds candidate markers in our real dataset associated with confounders rather than the primary disease (COPD) and bronchial carcinoma (BC). Carotta is publicly available at http://carotta.compbio.sdu.dk [1].

Highlights

In the last decade, the field of breathomics, defined as the metabolomics study of human exhaled air, grew tremendously
We identified 120 volatile organic compounds present in at least three of the patients’ measurements
The flexible back-end design supports easy extensions with plugins in the future, new clustering methods and statistics. It intuitively guides the user through four steps: (1) similarity matrix computation; (2) clustering; (3) clustering evaluation; and (4) results visualization and interpretation

Summary

Introduction

The field of breathomics, defined as the metabolomics study of human exhaled air, grew tremendously. One of the major goals is to non-invasively “sniff” biomarker molecules that are predictive for the biomedical fate of individual patients. These so-called personalized medicine (or precision medicine) approaches promise great hope to move the therapeutic windows to earlier stages of disease progression. Analytical technologies that overcome the obstacles of exhaled air analysis, like humidity and variability, exist. The computational methods, especially for advanced statistical breathomics analysis, are still in their infancy. To pave the way for this technology towards daily usage in medical practice, these challenges remain to be addressed

Methods

Results

Conclusion