Natural Language Processing Based Two-Stage Machine Learning Model for Automatic Mapping of Activity Codes Using Drilling Descriptions

Sharma Yogesh,Ahuja Bhawna,Gandikota Gurunath,Verma Shashwat

doi:10.2118/214522-ms

Abstract

Abstract The daily drilling report (DDR) contains information on daily activities and parameters from the well operations. The inputs are classified using activity codes to evaluate the field performance with improved decision-making. The coding levels support hierarchy in activity code sets. However, it requires information about a substantial number of codes and subcodes. Thus, accurate and consistent identification of codes for operation activities becomes challenging and time-consuming. This work proposes a novel approach to automatically suggest the activity code for drilling activities in well information management system (IMS), with the aim of facilitating the digitization of well operations. We propose a natural language processing (NLP) based two-stage machine learning (ML) model for prediction of activity codes using drilling activities descriptions. The methodology consists of data analysis to identify critical factors for developing ML model. To handle challenges of the diversity of the larger dataset, sampling approach is adopted. Augmentation via contextual embeddings is also explored for minority class. The term frequency-inverse document frequency (TFIDF) is used for feature extraction from text. The classifier is first trained to predict the main activity codes. Predicted main codes in the first stage become the feature space for the second stage training for enhanced accuracy. To improve the accuracy further, related subcodes are grouped according to confusion matrix, performance, and expert advice. This ML model is then integrated with IMS. This method was implemented on a large dataset consisting of 3000+ wells with 1M+ rows. With 70% of the dataset for the training, accuracies achieved for subcode prediction include 66% for the conventional model, 83% for grouped subcode prediction, and 92% for the proposed two-stage grouped subcode prediction. Hence, the proposed model outperforms the conventional model significantly. It is observed that the number of codes/subcodes affects the accuracy. During microservice development, memory requirement and latency are also examined. Increasing tree depths of the ML model after a certain point does not offer significant accuracy improvement though it leads to greater memory requirement and latency. Compression reduces the memory requirement significantly but at increased latency. Hence, an optimal trade-off between accuracy, latency and memory requirement may be attained by selecting model features. It is, therefore, established that the proposed workflow can be used to assist the digitalization of activity code mapping with potential benefits of improving performance, efficiency and reduced manual efforts in database information system for improving efficiency. Novelty of this approach lies in the use of two stage prediction where hierarchical nature of codes is utilized for enhancing accuracy with the help of advanced technologies such as NLP and ML. Grouping of related codes with expert knowledge and performance also provides a realistic solution for reducing the manual efforts.

Full Text