Abstract
A common approach to analyze a covariate-sample count matrix, an element of which represents how many times a covariate appears in a sample, is to factorize it under the Poisson likelihood. We show its limitation in capturing the tendency for a covariate present in a sample to both repeat itself and excite related ones. To address this limitation, we construct negative binomial factor analysis (NBFA) to factorize the matrix under the negative binomial likelihood, and relate it to a Dirichlet-multinomial distribution based mixed-membership model. To support countably infinite factors, we propose the hierarchical gamma-negative binomial process. By exploiting newly proved connections between discrete distributions, we construct two blocked and a collapsed Gibbs sampler that all adaptively truncate their number of factors, and demonstrate that the blocked Gibbs sampler developed under a compound Poisson representation converges fast and has low computational complexity. Example results show that NBFA has a distinct mechanism in adjusting its number of inferred factors according to the sample lengths, and provides clear advantages in parsimonious representation, predictive power, and computational complexity over previously proposed discrete latent variable models, which either completely ignore burstiness, or model only the burstiness of the covariates but not that of the factors.
Highlights
The need to analyze a covariate-sample count matrix, each of whose elements counts the number of time that a covariate appears in a sample, arises in many different settings, such as text analysis, next-generation sequencing, medical records mining, and consumer behavior studies
As Poisson factor analysis (PFA) is closely related to the canonical mixed-membership model built on the multinomial distribution, we show that negative binomial factor analysis (NBFA) is closely related to a Dirichlet-multinomial mixedmembership (DMMM) model that uses the Dirichlet-categorical (Dirichlet-multinomial) rather than categorical distributions to assign an index to both a covariate and a factor
To support countably infinite factors for NBFA, generalizing the gamma-negative binomial process (GNBP) (Zhou and Carin, 2015; Zhou et al, 2016b), we introduce a new nonparametric Bayesian prior: the hierarchical gamma-negative binomial process, where each of the J samples is assigned with a sample-specific GNBP and a globally shared gamma process is mixed with all the J GNBPs
Summary
The need to analyze a covariate-sample count matrix, each of whose elements counts the number of time that a covariate appears in a sample, arises in many different settings, such as text analysis, next-generation sequencing, medical records mining, and consumer behavior studies. Without capturing the self- and cross-excitation (stimulation) of covariate frequencies or better modeling the overdispersed covariate-sample counts, the ultimate potential of the mixed-membership model and PFA will be limited no matter how the priors on latent parameters are adjusted. It could be a waste of computation if the model tries to increase the model capacity to better capture overdispersions that could be explained with self- and cross-excitations To remove these restrictions, we introduce negative binomial factor analysis (NBFA) to factorize the covariate-sample count matrix, in which we replace the Poisson distributions on which PFA is built, with the negative binomial (NB) distributions. The proofs and Gibbs sampling update equations are provided in the Supplementary Material (Zhou, 2017)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.