New Trends in Evidence-based Statistics: Data Imputation Problems

N V Kovtun,A.-N Ya Fataliieva

doi:10.31767/su.4(87)2019.04.01

Abstract

The main reasons for omissions are: 1. Exclusion of the subject from the study due to non-compliance with study requirements; 2. The occurrence of an adverse event; 3. Missing result; 4. Lack of registration; 5. Researchers’ act of omission and / or commission.We can define the following data gap limits: 1) Less than 5% of omissions are insignificant and they do not affect the research results; 2) Data losses of 20% and more question the integrity of research results. The higher the share of the missing data, the less reliable the conclusions are, and the more difficult to prove the treatment efficiency is. Consequently, missing data is a potential source of bias when analyzing data. Exclusion of subjects can affect the compatibility of groups and subgroups, which leads to bias in the estimates.There are different ways to deal with missing data. The simplest is to exclude the subject from the calculations. But the consequences of this approach are: reduction in sample size; compromise in the extent of relevance for statistical inferences; change of a confidence interval (e.g. narrowing resulting from underestimation of variances). Hence, it is important to identify the nature of the omission when dealing with missing data which can be of missing completely at random (MCAR), missing at random (MAR) and missing not at random. This necessitates using an appropriate method of data processing with missing values: exclusion, filling, weighing and modeling. All these methods give different results with different volumes and nature of omissions.We attempted to evaluate the results of different imputation methods by using a sample with different proportions of missing data that were simulated. Thus, with 10% of the MCAR omissions, parameter estimates and p-value for two factors, resulting from the application of the first group of methods, were close to the result from complete data. Average square errors that were calculated by using the method of the absolute average, and the method of filling blank spaces with successive selection, were closer to the standard; all other methods overvalued this estimate. Coefficient of determination was almost similar to the initial data when the method of filling blank spaces with successive selection was applied. Data with 25% of missing MCAR: factor – treatment group became insignificant when the method of filling with absolute and conditional averages was applied. The lowest estimate for coefficient of determination was found when the method of filling with absolute average values was applied, and overestimation was the least when the method of filling blank spaces with successive selection was applied. The changes were minimal with other approaches. Thus, parameter estimates and p-value resulting from the application of the analysis method of available cases were closer to the result available from the regression on the complete data.Data with 50% of missing MCAR: Pre-treatment weight became insignificant when the analysis method of complete observations was applied. Factor treatment group became insignificant when the method of filling blank spaces with successive selection was applied. The most accurate estimate of pre-treatment weight variable was received from the result of the method of conditional average. But, the method of filling with absolute average can be singled out - its results were the closest to the initial data.According to the results of imputation with 10% and 50% of missing MAR data by each method, the change in parameter estimate for an intercept and two factors were minimal. It is with the application of the methods of multiple imputation that average square error and determination coefficient were the closest to the results, received from using complete data.This study identifies the weaknesses and the strengths of different methods of data imputation, and presents the effectiveness of applying the one method over the other one with different shares of missed information. Undisputedly, the result from this study established that the approach to the imputation process cannot be an “one-size-fits-all” and the imputation problem should be solved on a case-by-case basis by analysis of the existing database, taking into account not only the characteristics of the data itself and the volume of omissions, but also the expected contribution(s) from a particular study.

Highlights

Проведено порівняльний аналіз результатів застосування різних методів імпутації на прикладі вибірки, для якої симульовані різні варіанти пропусків даних
При 25% повністю випадкових пропусків для коефіцієнта детермінації найменша оцінка була при застосуванні методу заповнення безумовним середнім значенням, а переоцінка була найнижчою при методі заповнення пропусків з послідовним підбором
Середньоквадратичні помилки, розраховані за методом безумовного середнього та за методом заповнення пропусків з послідовним підбором, були найближчими до оригінальної моделі, всі інші методи завищували цю оцінку

Summary

ТЕОРІЯ ТА МЕТОДОЛОГІЯ СТАТИСТИКИ

Проведено порівняльний аналіз результатів застосування різних методів імпутації на прикладі вибірки, для якої симульовані різні варіанти пропусків даних. При 10% повністю випадкових пропусків оцінки параметрів і p-value для двох факторів, отримані у результаті застосування першої групи методів, наближені до результатів, одержаних на повних даних. Отримані оцінки параметрів і p-value в результаті застосування методу аналізу наявних випадків були більше наближені до значень, отриманих при побудові регресії на повних даних. Також можна виділити метод заповнення безумовним середнім, результати застосування якого були найбільш наближені до первинних даних. Саме при застосуванні методу множинної імпутації середньоквадратична помилка і коефіцієнт детермінації були максимально близькі до результатів, отриманих на основі повних даних. Нами була зроблена спроба оцінити результати застосування різних методів імпутації на прикладі вибірки, для якої симульовані різні варіанти пропусків даних (10%, 25%, 50%). Щоб проаналізувати, як група лікування та вага до дослідження визначають вагу після дослідження, була побудована модель лінійної регресії для оригінальних даних (табл. 1, 2, тут і далі – авторські розрахунки): Таблиця 1 Параметри кореляційно-регресійної моделі впливу виду лікування та ваги до лікування на вагу після лікування

Сума квадратів відхилень

Метод Повні дані

Метод максимальної правдоподібності

Повні дані Метод зважування Множинна імпутація

Множина імпутація

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Statistics of Ukraine	Publication Date: Mar 12, 2020
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

New Trends in Evidence-based Statistics: Data Imputation Problems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Statistics of Ukraine

Lead the way for us

Similar Papers

Software Implementation of Missing Data Recovery: Comparative Analysis
N V Kovtun ... A.-N Ya Fataliieva
Statistics of Ukraine | VOL. 91
N V Kovtun, et. al.N V Kovtun ... A.-N Ya Fataliieva
16 Dec 2020
Statistics of Ukraine | VOL. 91

A comparison of imputation methods for categorical data
Shaheen Mz Memon ... Ignace H Kabano
Informatics in Medicine Unlocked | VOL. 42
Shaheen Mz Memon, et. al.Shaheen Mz Memon ... Ignace H Kabano
01 Jan 2023
Informatics in Medicine Unlocked | VOL. 42

Identifying reprioritization response shift in a stroke caregiver population: a comparison of missing data methods.
Tolulope T Sajobi ... Nancy E Mayo
Quality of Life Research | VOL. 24
Tolulope T Sajobi, et. al.Tolulope T Sajobi ... Nancy E Mayo
26 Oct 2014
Quality of Life Research | VOL. 24

Comparison of Single and MICE Imputation Methods for Missing Values: A Simulation Study
Nurul Azifah Mohd Pauzi ... Yap Bee Wah
Pertanika Journal of Science and Technology | VOL. 29
Nurul Azifah Mohd Pauzi, et. al.Nurul Azifah Mohd Pauzi ... Yap Bee Wah
30 Apr 2021
Pertanika Journal of Science and Technology | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

New Trends in Evidence-based Statistics: Data Imputation Problems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Statistics of Ukraine