Abstract

Usually, time series data suffers from high percentage of missing values which is related to its nature and its collection process. This paper proposes a data imputation technique for imputing the missing values in time series data. The Fuzzy Gaussian membership function and the Fuzzy Triangular membership function are proposed in a data imputation algorithm in order to identify the best imputation for the missing values where the membership functions were used to calculate weights for the data values of the nearest neighbor’s before using them during imputation process. The evaluation results show that the proposed technique outperforms traditional data imputation techniques where the triangular fuzzy membership function has shown higher accuracy than the gaussian membership function during evaluation.

Highlights

  • In computer science field, the data quality problem began to rise in the 1990s with arise of the data warehouse systems where the failure of a database project was returned to its poor data quality. [1] There is a lot of definitions for the word “data quality” but as mentioned in [2] there is a well-known definition used by a lot of researchers which is “fitness for use”

  • These data quality dimensions consist of timelines to ensure that the value is new, consistency to ensure that representation of the data is unchanging in all cases, completeness to ensure that the data is completed with no missing values, and accuracy to ensure that the recorded value is identical with the actual value. [1]

  • The paper introduced two proposed techniques based on the fuzzy logic while imputing missing values in time series data

Read more

Summary

INTRODUCTION

The data quality problem began to rise in the 1990s with arise of the data warehouse systems where the failure of a database project was returned to its poor data quality. [1] There is a lot of definitions for the word “data quality” but as mentioned in [2] there is a well-known definition used by a lot of researchers which is “fitness for use”. The Missing at random (MAR): Variable is missing at random where the probability of missingness is depending only on an available information This type can be named as missing conditionally which means missing with a condition; for an example if gender is male, they will leave questions related to women in the survey empty. This paper aims to ensure the data quality of time series data It aims to ensure the completeness dimensions of the time series data that suffers from missing value. Towards this aim, two novel techniques for imputing the missing values in time series data are proposed and compared with traditional techniques. Evaluation Results shows that the two proposed techniques have higher accuracy than the traditional data imputing techniques.

RELATED WORK
Deletion and Ignoring Methods
Imputation Methods
Model-Based Methods
PROPOSED TECHNIQUE
PERFORMANCE EVALUATION AND DISCUSSION
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call