Abstract

BackgroundIncomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities.MethodsWe propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented.ResultsThe simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method.ConclusionsWe conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.

Highlights

  • Incomplete categorical variables with more than two categories are common in public health data

  • The methods compared include: fully observed (FO) analysis, which is treated as the gold standard because the analysis is applied before some of the Y s are removed; complete-case (CC) analysis, which excludes cases with missing Y ; the calibration estimator (CE); a parametric Multiple imputation (MI) (PMI), which imputes the missing values by taking the predictive values from a multinomial logistic regression model for the missing values; and the proposed nearest neighbor-based MI (NNMI) approach

  • The method using multinomial logistic regressions for the outcome model is denoted as NNMIMLR(NN, ω1, . . . ; ωM), and that using cumulative logistic regressions is denoted as NNMICLR(NN, ω1, . . . ; ωM)

Read more

Summary

Introduction

Incomplete categorical variables with more than two categories are common in public health data. In population studies of public health, the health status of participants is a research outcome of interest and is commonly demonstrated using ordinal categories such as “Excellent”, “Good”, “Fair”, and “Poor” These variables are typically subject to missing data. Zhou et al BMC Medical Research Methodology (2017) 17:87 problems It consists of iterative expectation and maximization steps for estimating the parameter [4]. This approach only relies on the information from a working model predicting the missing values/outcome, and ignores the information embedded in missingness probabilities, that is, the probabilities of being missing (or nonresponse probabilities). The corresponding estimates might not be robust to certain misspecifications of the outcome model

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call