Enhanced modeling approaches for count data analysis with focus on substance use outcomes.
The selection of appropriate statistical models is essential for accurately interpreting the analysis of count data, especially in behavioral medicine. Traditionally, Poisson and Negative Binomial models have been commonly employed, but they may not always be the most optimal choices, particularly when dealing with data with an abundance of zeroes, which can be effectively modeled using zero-inflated and zero-altered (hurdle) models. Additionally, U-shaped distributions where the data are clustered around both ends-low and high counts-with fewer occurrences in the middle, cannot be adequately captured by traditional approaches and further complicate the analysis. This paper critically examines the widespread use of zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models in the context of adolescent substance use data, identifying their potential limitations.Using a dataset from a smoking study of 1263 adolescents who reported smoking behavior across eight waves, we analyzed the sparse count outcome "Days Smoked in the Past Month," with covariates such as sex, age, and GPA recorded at each wave. Through a comprehensive evaluation of smoking behavior count outcomes-employing model identification via the Kolmogorov-Smirnov (KS) test, validation through confirmation studies, and regression analysis guided by Akaike Information Criterion (AIC). The range of models covered includes: ZIP, Poisson hurdle (PH), ZINB, negative binomial hurdle (NBH), zero-inflated negative binomial with fixedr(ZINB-r), negative binomial hurdle with fixedr(NBH-r), zero-inflated beta-binomial (ZIBB), beta-binomial hurdle (BBH), zero-inflated beta-binomial with fixed n (ZIBB-n), beta-binomial hurdle with fixedn(BBH-n), zero-inflated beta-binomial with fixedalphaandbeta(ZIBB-ab), beta-binomial hurdle with fixed alpha and beta (BBH-ab), zero-inflated beta negative binomial (ZIBNB), and beta negative binomial hurdle (BNBH). Our study demonstrates the superior model fitting and regression analysis capabilities of the ZIBB and BBH models. Notably, our findings reveal the effectiveness of the ZIBB model in capturing the U-shaped distribution observed in real-world data. This underscores the importance of exploring a wider range of models beyond ZIP and ZINB for count data analysis. This study advocates for the broader application of these more sophisticated models in behavioral medicine, with the goal to enhance the accuracy and reliability of research outcomes.
- Research Article
14
- 10.1007/s11676-015-0176-z
- Nov 27, 2015
- Journal of Forestry Research
The occurrence of lightning-induced forest fires during a time period is count data featuring over-dispersion (i.e., variance is larger than mean) and a high frequency of zero counts. In this study, we used six generalized linear models to examine the relationship between the occurrence of lightning-induced forest fires and meteorological factors in the Northern Daxing’an Mountains of China. The six models included Poisson, negative binomial (NB), zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB), Poisson hurdle (PH), and negative binomial hurdle (NBH) models. Goodness-of-fit was compared and tested among the six models using Akaike information criterion (AIC), sum of squared errors, likelihood ratio test, and Vuong test. The predictive performance of the models was assessed and compared using independent validation data by the data-splitting method. Based on the model AIC, the ZINB model best fitted the fire occurrence data, followed by (in order of smaller AIC) NBH, ZIP, NB, PH, and Poisson models. The ZINB model was also best for predicting either zero counts or positive counts (≥1). The two Hurdle models (PH and NBH) were better than ZIP, Poisson, and NB models for predicting positive counts, but worse than these three models for predicting zero counts. Thus, the ZINB model was the first choice for modeling the occurrence of lightning-induced forest fires in this study, which implied that the excessive zero counts of lightning-induced fires came from both structure and sampling zeros.
- Research Article
219
- 10.1080/10543400600719384
- Aug 1, 2006
- Journal of Biopharmaceutical Statistics
We compared several modeling strategies for vaccine adverse event count data in which the data are characterized by excess zeroes and heteroskedasticity. Count data are routinely modeled using Poisson and Negative Binomial (NB) regression but zero-inflated and hurdle models may be advantageous in this setting. Here we compared the fit of the Poisson, Negative Binomial (NB), zero-inflated Poisson (ZIP), zero-inflated Negative Binomial (ZINB), Poisson Hurdle (PH), and Negative Binomial Hurdle (NBH) models. In general, for public health studies, we may conceptualize zero-inflated models as allowing zeroes to arise from at-risk and not-at-risk populations. In contrast, hurdle models may be conceptualized as having zeroes only from an at-risk population. Our results illustrate, for our data, that the ZINB and NBH models are preferred but these models are indistinguishable with respect to fit. Choosing between the zero-inflated and hurdle modeling framework, assuming Poisson and NB models are inadequate because of excess zeroes, should generally be based on the study design and purpose. If the study's purpose is inference then modeling framework should be considered. For example, if the study design leads to count endpoints with both structural and sample zeroes then generally the zero-inflated modeling framework is more appropriate, while in contrast, if the endpoint of interest, by design, only exhibits sample zeroes (e.g., at-risk participants) then the hurdle model framework is generally preferred. Conversely, if the study's primary purpose it is to develop a prediction model then both the zero-inflated and hurdle modeling frameworks should be adequate.
- Research Article
3
- 10.24920/j1001-9294.2017.054
- Jan 1, 2017
- Chinese Medical Sciences Journal
Study of Zero-Inflated Regression Models in a Large-Scale Population Survey of Sub-Health Status and Its Influencing Factors.
- Research Article
33
- 10.1136/ip.2011.031740
- Jun 8, 2011
- Injury Prevention
ObjectiveTo examine the appropriateness of different statistical models in analysing falls count data.MethodsSix count models (Poisson, negative binomial (NB), zero-inflated Poisson (ZIP), zero-inflated NB (ZINB), hurdle Poisson (HP) and hurdle...
- Research Article
11
- 10.4103/0970-9290.74210
- Jan 1, 2010
- Indian Journal of Dental Research
The study aimed to analyze and determine the factors associated with dental caries experience contains many zeros by zero inflated models. A cross sectional design was employed using clinical examination and questionnaire with interview method. A study was conducted during March-August 2007 in Dharwad, Karnataka, India, involved a systematic random samples of 1760 individuals aged 18-40 years. The dental caries examination was carried out by using DMFT index (i.e. Decayed (D), Missing (M), Filled (F)). The DMFT index data contains many zeros were analyzed with Zero Inflated Poisson (ZIP) and Zero Inflated Negative Binomial (ZINB) models. The study findings indicated, the variables such as family size, frequency of brushing and duration of change of toothbrush were positively associated with dental caries. But the variable the frequency of sweet consumption is negatively associated with dental caries experience in Zero Inflated Poisson (ZIP) and Zero Inflated Negative Binomial (ZINB) models. The ZIP model is a very good fit over the standard Poisson model and the ZINB is the better statistical fit compared to the Negative Binomial model. The Zero Inflated Negative Binomial model is better fit over the Zero Inflated Poisson model for modeling the DMF count data.
- Conference Article
- 10.65286/icic.v20i2.79693
- Jan 1, 2024
Background: Patients diagnosed with Budd–Chiari syndrome (BCS) who have undergone standardized treatments face a relatively high risk of recurrence when persistent risk factors are present. However, there are limited studies that analyze the factors influencing the frequency of recurrence using count data. Objective: To compare the goodness of fit for count models to identify the optimal model for assessing the recurrence frequency of BCS, and further analyze the factors contributing to BCS recurrence. Methods: The study included a total of 754 patients who were admitted to the Affiliated Hospital of Xuzhou Medical University between January 2015 and July 2022 and met the inclusion criteria. Among them, 243 experienced recurrences during the follow-up period. Using the recurrence frequency of patients with BCS as the dependent variable in the training cohort, we constructed four different count outcome models in R: the Poisson model, the negative binomial (NB) model, the zero-inflated Poisson (ZIP) model, and the zero-inflated negative binomial (ZINB) model. We employed the O test to detect over-dispersion in the data. The models were compared using the Vuong test, log-likelihood ratio test (LR), Akaike information criteria (AIC), corrected AIC (AICc), −2LogLikelihood (−2LL), root mean squared error (RMSE), mean absolute error (MAE), accuracy, precision, and graphical methods to select the model with the best fitting performance for exploring factors associated with BCS recurrence. Results: Of all 754 respondents, 511 patients reported no recurrences. The mean recurrence frequency was 0.64, with a variance of 2.46. The O statistic was 55.08 (p < 0.001), indicating over-dispersion in the data. The plot of predictions revealed that the predicted values of the ZINB model closely matched the actual values. The Vuong test revealed that the ZIP model outperformed the Poisson regression model (z = 34.29, p < 0.001), and the ZINB model was superior to the NB model (z = 3.40, p < 0.001). The LR tests indicated that the NB model performed better than the Poisson regression model (χ² = 124.91, p < 0.001), and the ZINB model outperformed the ZIP model (χ² = 34.29, p < 0.001). The ZINB model had the lowest −2LL (1100.26), AIC (1182.26), AICc (1188.40), MAE (0.94), and RMSE (2.02), whereas it achieved the highest accuracy (58.94%) and precision (28.07%) among the four models. In the ZINB model, the analysis of the counting process revealed that the variables significantly associated with recurrence frequency included age (odds ratio [OR] = 0.69; 95% confidence interval [CI]: 0.57–0.84), sex (female: OR = 1.77; 95% CI: 1.24–2.55), anticoagulant use (warfarin vs. new oral anticoagulants [NOACs]: OR = 2.11, 95% CI: 1.34–3.31; not using anticoagulants vs. NOACs: OR = 1.98, 95% CI: 1.20–3.28), absence of cirrhosis (OR = 0.57, 95% CI: 0.40–0.82), and neutrophil count (OR = 1.22, 95% CI: 1.04–1.42). The zero process analysis revealed that sex (female: OR = 22.43, 95% CI: 2.41–208.46), the type of operation (balloon dilatation combined with stent implantation vs. simple balloon dilatation: OR = 17.49, 95% CI: 1.32–231.99), anticoagulant use (warfarin vs. NOACs: OR = 7.10, 95% CI: 1.12–45.15; not using anticoagulants vs. NOACs: OR = 14.51, 95% CI: 2.33–90.24), absence of cirrhosis (OR = 0.15, 95% CI: 0.04–0.62), hospital duration (OR = 0.40, 95% CI: 0.20–0.81), and apolipoprotein A (OR = 0.37, 95% CI: 0.18–0.74) significantly impacted the likelihood of recurrence. Conclusions: The zero-inflated model proves robust in identifying factors influencing BCS recurrence compared to other models, elucidating the influence of gender, surgery, anticoagulation, cirrhosis, hospital duration, APOA, and neutrophil count on recurrence risk and frequency of BCS patients.
- Research Article
22
- 10.1080/03610926.2017.1402050
- Dec 6, 2017
- Communications in Statistics - Theory and Methods
ABSTRACTThe objective of this study is providing a comparative assessment for researchers to deal with the challenges of analyzing count data and examining the factors associated with daily cigarette consumption among the young people in Turkey. We fitted Poisson (P), negative binomial (NB), zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB), Poisson hurdle (PH) and negative binomial hurdle (NBH) regressions to cigarette consumption count data by using the 2014 Turkey Health Survey. Our results showed that the ZINB and NBH models should be preferred. We also found that, gender, employment and tobacco use at home are more effective factors for smokers and nonsmokers in the 15–24 age group in Turkey.
- Research Article
6
- 10.5539/ijsp.v7n3p22
- Apr 17, 2018
- International Journal of Statistics and Probability
Poisson and negative binomial regression models have been used as a standard for modelling count outcomes; but these methods do not take into account the problems associated with excess zeros. However, zero-inflated and hurdle models have been proposed to model count data with excess zeros. The study therefore compared the performance of Zero-inflated (Zero-inflated Poisson (ZIP) and Zero-inflated negative binomial (ZINB)), and hurdle (Hurdle Poisson (HP) and Hurdle negative binomial (HNB)) models in determining the factors associated with the number of Antenatal Care (ANC) visits in Nigeria. Using the 2013 Nigeria Demographic and Health Survey dataset, a sample of 19 652 women of reproductive age who gave birth five years prior to the survey and provided information about ANC visits was utilised. Data were analysed using descriptive statistics, ZIP, ZINB, HP and HNB models, and information criteria (AIC/BIC) was used to assess model fit. Participants’ mean age was 29.5 ± 7.3 years and median number of ANC visits was 4 (range: 0 - 30). About half (54.9%) of the participants had at least 4 ANC visits while 33.9% had none. The ZINB (AIC = 83 039.4; BIC = 83 470.3) fitted the data better than the ZIP or HP; however, HNB (AIC = 83 041.4; BIC = 83 472.3) competed favorably well with it. The Zero-inflated negative binomial model provided the better fit for the data. We suggest the Zero-inflated negative binomial model for count data with excess zeros of unknown sources such as the number of ANC visits in Nigeria.
- Research Article
57
- 10.1016/j.pedobi.2007.11.003
- Feb 29, 2008
- Pedobiologia
The excess-zero problem in soil animal count data and choice of appropriate models for statistical inference
- Research Article
- 10.30442/ahr.0402-6-17
- Dec 9, 2018
- Annals of Health Research
Background: Estimates of Under-Five mortality (U5M) have taken advantage of indirect methods but U5M risk factors have been identified using fixed statistical models with little considerations for the potentials of mixture models. Mixture models such as Poisson-Mixture models exhibit flexibility tendency, which is an attribute of robustness lacking in fixed models. Objective: To examine the robustness of Poisson-Mixture models in identifying reliable determinants of U5M. Methods: The data on 18,855 women used in this study were obtained from the 2008 Nigeria Demographic and Health Survey (NDHS). Six different Poisson-Mixture models namely: Poisson (PO), Zero-Inflated Poisson (ZIP), Poisson Hurdle (PH), Negative Binomial (NBI), Zero-Inflated Negative Binomial (ZINBI) and Negative Binomial Hurdle (NBIH) were fitted separately to the data. The Akaike Information Criteria (AIC) and diagnostic check for normality were used to select robust models. All tests were conducted at p = 0.05. Results: The models and AIC values for U5M were: 38763.47 (PO), 38654.55 (ZIP), 44270.77 (PH), 38526.26 (NBI), 38513.71 (ZINBI) and 44269.30 (NBIH). The PO, ZIP, PH and NBIH met normality test criteria, and the ZIP model was of best fit. The model identified breastfeeding, paternal education, toilet type, maternal education, place of delivery, birth-order and antenatal-visits as significant determinants of U5M at the national level. Conclusion: The Zero-Inflated Poisson model provided the best robust estimates of Under-five Mortality in Nigeria, while maternal education and birth-order were identified as the most important determinants. The Poisson-mixture models are recommended for modelling Under-five Mortality in Nigeria.
- Research Article
51
- 10.1093/ntr/nty072
- Apr 18, 2018
- Nicotine & Tobacco Research
This paper describes different methods for analyzing counts and illustrates their use on cigarette and marijuana smoking data. The Poisson, zero-inflated Poisson (ZIP), hurdle Poisson (HUP), negative binomial (NB), zero-inflated negative binomial (ZINB) and hurdle negative binomial (HUNB) regression models are considered. The different approaches are evaluated in terms of the ability to take into account zero-inflation (extra zeroes) and overdispersion (variance larger than expected) in count outcomes, with emphasis placed on model fit, interpretation, and choosing an appropriate model given the nature of the data. The illustrative data example focuses on cigarette and marijuana smoking reports from a study on smoking habits among youth e-cigarette users with gender, age, and e-cigarette use included as predictors. Of the 69 subjects available for analysis, 36% and 64% reported smoking no cigarettes and no marijuana, respectively, suggesting both outcomes might be zero-inflated. Both outcomes were also overdispersed with large positive skew. The ZINB and HUNB models fit the cigarette counts best. According to goodness-of-fit statistics, the NB, HUNB, and ZINB models fit the marijuana data well, but the ZINB provided better interpretation. In the absence of zero-inflation, the NB model fits smoking data well, which is typically overdispersed. In the presence of zero-inflation, the ZINB or HUNB model is recommended to account for additional heterogeneity. In addition to model fit and interpretability, choosing between a zero-inflated or hurdle model should ultimately depend on the assumptions regarding the zeros, study design, and the research question being asked. Count outcomes are frequent in tobacco research and often have many zeros and exhibit large variance and skew. Analyzing such data based on methods requiring a normally distributed outcome are inappropriate and will likely produce spurious results. This study compares and contrasts appropriate methods for analyzing count data, specifically those with an over-abundance of zeros, and illustrates their use on cigarette and marijuana smoking data. Recommendations are provided.
- Research Article
1
- 10.11648/j.ijdsa.20180401.15
- Jan 1, 2018
- International Journal of Data Science and Analysis
Count data has been witnessed in a wide range of disciplines in real life. Poisson, negative binomial (NB), zero inflated Poisson (ZIP) and zero inflated negative binomial (ZINB) are some of the regression models proposed to model data with count response. All the count models are potential candidates that can model count data, but there is no means to choose the one that would perform better than the others. This study aimed to assess the count models mentioned earlier at various degrees of zero inflation. Datasets were simulated with ZIP distribution with different conditions of zero inflation (0%, 2%, 5%, 10%, 15%, 20%, 30% and 40%). Poisson and NB were observed to predict regression coefficients well when the proportion of zero is below 15%. The two ZIM performed well at higher degrees of zero inflation; beyond 15% for ZIP and 20% for ZINB. Exploratory examination of the caries data revealed a zero inflation below 15%, that is, 3.23%. Analysis of early childhood caries (ECC) data among 3-6 year old children who visited Lady Northey Dental Clinic was then performed with Poisson and NB. Akaike information criterion (AIC) test was used to compare all the competing models both under simulation and with real data. Poisson yielded lower AIC values at lower zero inflation rates as compared to other three models. ZIP had the lowest AIC value at 10%, 15%, 20%, 30% and 40% levels of zero inflation. NB model had the lowest AIC value when real data was analyzed. Education level of the father- primary school completed, chewing gum several times a week, Feeding habit jam several times a day, Feeding habit juice every day, Feeding habit soda every day and Feeding habit sweets several times a week were found to be significant factors causing ECC.
- Research Article
- 10.1093/jas/skad281.626
- Nov 6, 2023
- Journal of Animal Science
Fecal egg count (FEC) is used as an indicator of parasite infection level in sheep. The distribution of FEC is non-Gaussian and typically overdispersed, often with an excess in zero counts. Quantifying the extent of inflation of zero counts can be difficult. Our objective was to assess the potential zero-inflation problem in FEC resulting from variation in infection with gastrointestinal nematodes by using a generalized linear model approach. The zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models are useful techniques to analyze count data with excess zeros; ZINB also handles overdispersion. The ZINB model has the potential to delineate ‘true’ zeros, in this case, animals resistant to parasitism and thereby with zero FEC, from ‘false’ zeros, animals never or minimally exposed to a parasite challenge. By distinguishing false zeros, those animals expressing parasite resistance may be more clearly identified. Two datasets on Katahdin sheep, a hair breed known to express resistance to gastrointestinal nematodes, were investigated; a smaller set (n = 3,048) with FEC and FAMACHA (FAM) scores, a subjective measure of anemia indicative of parasitism by Haemonchus contortus, a blood sucking helminth; and a larger set (n = 14,405) with FEC and a contemporary group (CG) designation, assigned based on the flock, birth year, management group, and FEC recording date of the animal. Among animals with FAM recorded, 14% had scores indicative of at least border line anemia. Amongst the 410 CG, 22% had mean FEC more than 500 egg/g, a threshold value routinely used to indicate a substantial infection level with H. contortus. For each dataset, the Poisson, Negative Binomial, ZIP, and ZINB models were fit and compared using R (pscl package) and SAS software, with the ZINB providing the best fit. In the models considered, FEC was the response variable and either FAM or CG was the explanatory variable, depending on the dataset. Despite evidence of parasite challenge, the true and false zeros could not be delineated in both data sets using these models. The estimated proportion of false zeros due to inflation did not differ from the proportion of zeros observed in the data set. Either all zeros coincided with no infection, which seems unlikely in Katahdins, or neither FAM nor CG provided sufficient information to distinguish resistant from uninfected individuals. Alternative or additional explanatory variables, such as packed cell volume or immunoglobulin concentrations indicative of parasitic infection, may be necessary to separate true from false zero FEC in sheep challenged with gastrointestinal nematodes using the ZINB model.
- Research Article
42
- 10.1186/s12874-022-01685-8
- Aug 4, 2022
- BMC Medical Research Methodology
BackgroundHospital length of stay (LOS) is a key indicator of hospital care management efficiency, cost of care, and hospital planning. Hospital LOS is often used as a measure of a post-medical procedure outcome, as a guide to the benefit of a treatment of interest, or as an important risk factor for adverse events. Therefore, understanding hospital LOS variability is always an important healthcare focus. Hospital LOS data can be treated as count data, with discrete and non-negative values, typically right skewed, and often exhibiting excessive zeros. In this study, we compared the performance of the Poisson, negative binomial (NB), zero-inflated Poisson (ZIP), and zero-inflated negative binomial (ZINB) regression models using simulated and empirical data.MethodsData were generated under different simulation scenarios with varying sample sizes, proportions of zeros, and levels of overdispersion. Analysis of hospital LOS was conducted using empirical data from the Medical Information Mart for Intensive Care database.ResultsResults showed that Poisson and ZIP models performed poorly in overdispersed data. ZIP outperformed the rest of the regression models when the overdispersion is due to zero-inflation only. NB and ZINB regression models faced substantial convergence issues when incorrectly used to model equidispersed data. NB model provided the best fit in overdispersed data and outperformed the ZINB model in many simulation scenarios with combinations of zero-inflation and overdispersion, regardless of the sample size. In the empirical data analysis, we demonstrated that fitting incorrect models to overdispersed data leaded to incorrect regression coefficients estimates and overstated significance of some of the predictors.ConclusionsBased on this study, we recommend to the researchers that they consider the ZIP models for count data with zero-inflation only and NB models for overdispersed data or data with combinations of zero-inflation and overdispersion. If the researcher believes there are two different data generating mechanisms producing zeros, then the ZINB regression model may provide greater flexibility when modeling the zero-inflation and overdispersion.
- Research Article
- 10.20527/epsilon.v15i1.3676
- Jul 16, 2021
- EPSILON: JURNAL MATEMATIKA MURNI DAN TERAPAN
Data that states the number of events in a certain period of time is called count data. Poisson regression is one of the regression models included in the application of GLM that can be used to model the count data. In Poisson regression, there are assumptions that must be met, namely the mean and variance of the response variables must be the same (equidispersion). Several models that are able to overcome overdispersion due to excess zero are the Zero Inflated model and the Hurdle model. This study examines the characteristics of parameter estimation in the modeling of quantified data that is overdispersed due to excess zero using the Zero Inflated Poisson (ZIP), Zero Inflated Negative Binomial (ZINB), Hurdle Poisson (HP) model and the Hurdle Negative Binomial (HNB) model in cases of diphtheria. in West Sumatra in 2018. Based on individual parameter testing and AIC values, the HP model provides better performance than the ZIP, ZINB, and HNB models. So the Hurdle Poisson model is better used in this case than other models
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.