Quantitative Evaluation of Imputation Methods Using Bounds Estimation of the Coefficient of Determination for Data-Driven Models with an Application to Drilling Logs

Jie Cao,Øystein Arild,Andrzej T Tunkiel,Dan Sui

doi:10.2118/214323-pa

Abstract

Summary With the constantly increasing quantity of data recorded in the oil and gas industry, data analytics and data-driven algorithms are gaining popularity. Meanwhile, they are highly sensitive to employed data management methods. Using drilling data as an example, the methods for data quality improvement play an ever increasing role in the data preparation phase. One of the most common data issues, especially in real-time sensing systems, is missing data (gaps between measured points). Various data imputation approaches (forward filling, interpolation, regression modeling, etc.) have been used to fill gaps to complete data sets as standard data processing procedures. The metrics, such as coefficient of determination (R2), mean absolute error, root mean square error, and mean absolute percentage error, are common ways to evaluate the imputation approaches, assuming that the ground truth data are at hand. In reality, the ground truth of missing data is not available for quantitative method evaluation. Especially when data are received occasionally, the ground truth is impossibly or inaccurately estimated, leading to tricky situations. For instance, how to evaluate infilling methods quantitatively or how to compare the methods’ behaviors? To the best of the authors’ knowledge, there are so far few existing methods to quantitatively estimate the accuracy of data imputation without the ground truth. To some extent, one may lack proven confidence in using imputed data for following up data analysis and data-driven modeling. In this study, a novel approach has been developed to quantitatively evaluate the data imputation quality. The presented method is built on an analytical estimation of the R2 bounds in the context of linear interpolation. The reasons for choosing linear regression as a benchmarking method are twofold. First, the bounds evaluation of R2 can be determined analytically for data imputation in the context of linear regression. Second, linear regression is a relatively simple imputation method, and we assume that more sophisticated methods, such as random forest (RF), k-nearest neighbors (KNN), and other iterative imputation techniques, can achieve higher accuracy with proper parameter tuning. The evaluation method is demonstrated by examples using real-time drilling data logs. Moreover, the application to drilling logs also enables informed decision-making in relation to gap filling and data resampling workflows.

Full Text