Applications of a Novel Clustering Approach Using Non-Negative Matrix Factorization to Environmental Research in Public Health.

Paul Fogel,Fajwel Fogel,George Luta,Douglas Hawkins,S Young,Yann Gaston-Mathé

doi:10.3390/ijerph13050509

Abstract

Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA). With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability.

Highlights

Let us consider the number of emergency hospital admissions in several US communities for specific diseases, e.g., cardiovascular disease (CVD), myocardial infarction (MI), and congestive heart failure (CHF)
We showed that the heatmap of the reordered rows and columns of a matrix, when properly normalized, can add insight to the Singular Value Decomposition (SVD) clustering produced by Correspondence Analysis (CA), in particular with respect to the interpretation of the biplot axes
We showed that PosNegNMF clustering returns more homogenous clusters, in contrast to affine negative Matrix Factorization (NMF) clustering

Summary

Introduction

Let us consider the number of emergency hospital admissions in several US communities for specific diseases, e.g., cardiovascular disease (CVD), myocardial infarction (MI), and congestive heart failure (CHF). A pattern of admission causes may be characterized by unusually high and/or low counts for some of the possible causes. A specific community may have a high similarity with a particular pattern, e.g., high CVD and low MI, a somewhat lower similarity with the opposite pattern, e.g., low CVD and high MI, and negligible similarities with other patterns. An admission pattern can be thought of as the pattern of an archetypal community, in which all admission causes have average count levels, except for the ones that are unusually high /low and characterize the pattern itself. What if a community’s pattern is a mixture of patterns of archetypal communities, rather than being similar to one specific pattern? Public Health 2016, 13, 509; doi:10.3390/ijerph13050509 www.mdpi.com/journal/ijerph

Methods

Discussion

Conclusion