Proportion-based normalizations outperform compositional data transformations in machine learning applications

Aaron Yerke,Daisy Fry Brumit,Anthony A Fodor

doi:10.1186/s40168-023-01747-z

Aaron Yerke, Daisy Fry Brumit + Show 1 more

Open Access

PDF Available

https://doi.org/10.1186/s40168-023-01747-z

Copy DOI

Export

Save

Cite

Journal: Microbiome	Publication Date: Mar 5, 2024
Citations: 2	License type: CC BY 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundNormalization, as a pre-processing step, can significantly affect the resolution of machine learning analysis for microbiome studies. There are countless options for normalization scheme selection. In this study, we examined compositionally aware algorithms including the additive log ratio (alr), the centered log ratio (clr), and a recent evolution of the isometric log ratio (ilr) in the form of balance trees made with the PhILR R package. We also looked at compositionally naïve transformations such as raw counts tables and several transformations that are based on relative abundance, such as proportions, the Hellinger transformation, and a transformation based on the logarithm of proportions (which we call “lognorm”).ResultsIn our evaluation, we used 65 metadata variables culled from four publicly available datasets at the amplicon sequence variant (ASV) level with a random forest machine learning algorithm. We found that different common pre-processing steps in the creation of the balance trees made very little difference in overall performance. Overall, we found that the compositionally aware data transformations such as alr, clr, and ilr (PhILR) performed generally slightly worse or only as well as compositionally naïve transformations. However, relative abundance-based transformations outperformed most other transformations by a small but reliably statistically significant margin.ConclusionsOur results suggest that minimizing the complexity of transformations while correcting for read depth may be a generally preferable strategy in preparing data for machine learning compared to more sophisticated, but more complex, transformations that attempt to better correct for compositionality.DL26Qs8hz41d9wKrFGqt9FVideo

Full Text