Abstract Background The UAE Healthy Future Study (UAEHFS) is one of the first large prospective cohort studies in the Gulf region which examines causes and risk factors for chronic diseases among adult UAE nationals. Missing values are often unavoidable in empirical research and can in many cases, lead to bias. The aim of this study is to estimate the percentage of depression in the UAEHFS pilot data using the eight-item Patient Health Questionnaire (PHQ-8) variables, using different statistical methods. Methods Five common statistical machine learning methods of handling missing values were included in this analysis. These are mode imputation, k-nearest neighbor (KNN) imputation, classification, and regression trees (CART), random forest (RF) imputations, and random sample from observed values (Sample). 100 multiple imputations were used. Results 487 (94.2 %) eligible participants were included in the analysis. 231 (44.7%) were included in the complete case analysis. The median age was 30 years (Interquartile-Range: 23 - 38). More males (67.8%) than females included in the analysis. The estimated percentage of depression was 8.4%, 8.9%, 9.9%, 12.5%, 15.4% and 17.9% by the mode, complete case, sample, RF, CART, and KNN respectively. In additional analyses, the estimated proportions of depression were 11.5% by the Complete Case, 11.9% by KNN, 13.2% by K-means clustering, and 13.2% by Random Forest. Conclusions The estimated percentage of depression in the UAEHFS pilot data varies between the applied methods of handling missing values. This shows that the problem of missing values in the variables is not negligible. Further research is needed using multiple imputations in the main UAEHFS dataset after completing recruitment. Key messages • For the depression missing values, we recommend using multiple imputations not to generate data but to prevent the exclusion of observed data. • To have a better estimate of the percentage of depression, is recommended to use different machine learning methods.