A comparative study of different imputation methods for daily rainfall data in east-coast Peninsular Malaysia

Siti Mariana Che Mat Nor,Mou Leong Tan,Shuhaida Ismail,Nurul Hila Zainuddin,Shazlyn Milleana Shaharudin

doi:10.11591/eei.v9i2.2090

Abstract

Rainfall data are the most significant values in hydrology and climatology modelling. However, the datasets are prone to missing values due to various issues. This study aspires to impute the rainfall missing values by using various imputation method such as Replace by Mean, Nearest Neighbor, Random Forest, Non-linear Interactive Partial Least-Square (NIPALS) and Markov Chain Monte Carlo (MCMC). Daily rainfall datasets from 48 rainfall stations across east-coast Peninsular Malaysia were used in this study. The dataset were then fed into Multiple Linear Regression (MLR) model. The performance of abovementioned methods were evaluated using Root Mean Square Method (RMSE), Mean Absolute Error (MAE) and Nash-Sutcliffe Efficiency Coefficient (CE). The experimental results showed that RF coupled with MLR (RF-MLR) approach was attained as more fitting for satisfying the missing data in east-coast Peninsular Malaysia.

Highlights

In climatology and hydrological modeling, daily rainfall data is among the significant variables.Water resources management requires comprehensive hydrological variables datasets, including volumes, temperature and water level
There are three type of missing data were taken into account, which are Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR)
The imputation methods were applied for 48 stations in east-coast of Peninsular Malaysia

Summary

Introduction

In climatology and hydrological modeling, daily rainfall data is among the significant variables.Water resources management requires comprehensive hydrological variables datasets, including volumes, temperature and water level. As for hydrological data especially in the case of missing in rainfall datasets, it is classified as MCAR since the data in that area or any area does not affect the occurrence of missing in rainfall datasets of an area [1, 2]. It had been reported by [3] that the imputation for univariate time series hydrological data were classified as MCAR and MAR. MCAR concerns the data where the chance of a particular missing values are independent of any dataset variables [2]. The most convenient practice to handle the missing data is by deleting the entire observations containing the missing data and analyzing the retained complete data

Objectives

Results

Conclusion