Labeled Training Set Research Articles

This article, written by Special Publications Editor Adam Wilson, contains highlights of paper SPE 181015, “Natural-Language-Processing Techniques on Oil and Gas Drilling Data,” by M. Antoniak, J. Dalgliesh, SPE, and M. Verkruyse, Maana, and J. Lo, Chevron, prepared for the 2016 SPE Intelligent Energy International Conference and Exhibition, Aberdeen, 6–8 September. The paper has not been peer reviewed. Recent advances in search, machine learning, and natural-language processing have made it possible to extract structured information from free text, providing a new and largely untapped source of insight for well and reservoir planning. However, major challenges are involved in applying these techniques to data that are messy or that lack a labeled training set. This paper presents a method to compare the distribution of hypothesized and realized risks to oil wells described in two data sets that contain free-text descriptions of risks. Introduction In the oil and gas industry, risk identification and risk assessment are critical. This holds particularly true during the drilling stages, which cannot begin before a risk assessment is conducted. While these risk assessments are typically conducted in a group setting, the project drilling engineer usually has a predetermined list of risks and likelihood scores that are the focus of the conversation. One problem with this approach is that drilling engineers are inherently biased by personal experiences, which can affect their view on how likely an event is to happen. For example, if a project drilling engineer recently encountered well-control issues, the engineer will likely overestimate the chance of future well-control issues. On the other hand, if the engineer has never encountered a well-control issue, it may be unintentionally omitted altogether from the risk assessments. Using historical data as a barometer could help the drilling engineer overcome these issues, though doing so requires a unified view of both prior risk assessments and prior issues encountered. Chevron maintains both data sets in disparate systems. The Risk Assessment database contains descriptions of risks from historical risk assessments, and the Well Operations database contains descriptions of unexpected events and associated unexpected-event codes, which categorize the unexpected events. Leveraging both, a system has been created that allows a project drilling engineer to enter a risk in natural language, return drilling codes related to this risk, produce statistics showing how often these types of events have happened in the past, and predict the likelihood of the problem occurring in certain fields.

Parathyroidectomy offers the only cure for primary hyperparathyroidism, but today only 50% of primary hyperparathyroidism patients are referred for operation, in large part, because the condition is widely under-recognized. The diagnosis of primary hyperparathyroidism can be especially challenging with mild biochemical indices. Machine learning is a collection of methods in which computersbuild predictive algorithms based on labeled examples. With the aim of facilitating diagnosis, we tested the ability of machine learning to distinguish primary hyperparathyroidism from normal physiology using clinical and laboratory data. This retrospective cohort study used a labeled training set and 10-fold cross-validation to evaluate accuracy of the algorithm. Measures of accuracy included area under the receiver operating characteristic curve, precision (sensitivity), and positive and negative predictive value. Several different algorithms and ensembles of algorithms were tested using the Weka platform. Among 11,830 patients managed operatively at 3 high-volume endocrine surgery programs from March 2001 to August 2013, 6,777 underwent parathyroidectomy for confirmed primary hyperparathyroidism, and 5,053 control patients without primary hyperparathyroidism underwent thyroidectomy. Test-set accuracies for machine learning models were determined using 10-fold cross-validation. Age, sex, and serum levels of preoperative calcium, phosphate, parathyroid hormone, vitamin D, and creatinine were defined as potential predictors of primary hyperparathyroidism. Mild primary hyperparathyroidism was defined as primary hyperparathyroidism with normal preoperative calcium or parathyroid hormone levels. After testing a variety of machine learning algorithms, Bayesian network models proved most accurate, classifying correctly 95.2% of all primary hyperparathyroidism patients (area under receiver operating characteristic=0.989). Omitting parathyroid hormone from the model did not decrease the accuracy significantly (area under receiver operating characteristic=0.985). In mild disease cases, however, the Bayesian network model classified correctly 71.1% of patients with normal calcium and 92.1% with normal parathyroid hormone levels preoperatively. Bayesian networking and AdaBoost improved the accuracy of all parathyroid hormone patients to 97.2% cases (area under receiver operating characteristic=0.994), and 91.9% of primary hyperparathyroidism patients with mild disease. This was significantly improved relative to Bayesian networking alone (P<.0001). Machine learning can diagnose accurately primary hyperparathyroidism without human input even in mild disease. Incorporation of this tool into electronic medical record systems may aid in recognition of this under-diagnosed disorder.

Labeled Training Set Research Articles

Related Topics

Articles published on Labeled Training Set

Text sentiment analysis based on CBOW model and deep learning in big data environment

The first annotated set of scanning electron microscopy images for nanoscience

Learning Multiple Kernel Metrics for Iterative Person Re-Identification

Conversation Modeling with Neural Network

Learning to classify from impure samples with high-dimensional data

Ranking-Preserving Low-Rank Factorization for Image Annotation With Missing Labels

Rademacher Complexity Bounds for a Penalized Multi-class Semi-supervised Algorithm

An unsupervised classification approach for hyperspectral images based adaptive spatial and spectral neighborhood selection and graph clustering

Identifying genotype-phenotype relationships in biomedical text

Robust discriminative tracking via structured prior regularization

Neural Network for Nanoscience Scanning Electron Microscope Image Recognition

Natural-Language-Processing Techniques for Oil and Gas Drilling Data

Harbor Water Area Extraction From Pan-Sharpened Remotely Sensed Images Based on the Definition Circle Model

Clinical Research Informatics: Contributions from 2016

Sparse Representation-Based Semi-Supervised Regression for People Counting

Deep Label Distribution Learning With Label Ambiguity.

Clinical Research Informatics: Contributions from 2016.

Improving diagnostic recognition of primary hyperparathyroidism with machine learning

A RobustGNSSLOS/NLOS Signal Classifier

Improving semi-supervised self-training with embedded manifold transduction

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Labeled Training Set Research Articles

Related Topics

Articles published on Labeled Training Set

Text sentiment analysis based on CBOW model and deep learning in big data environment

The first annotated set of scanning electron microscopy images for nanoscience

Learning Multiple Kernel Metrics for Iterative Person Re-Identification

Conversation Modeling with Neural Network

Learning to classify from impure samples with high-dimensional data

Ranking-Preserving Low-Rank Factorization for Image Annotation With Missing Labels

Rademacher Complexity Bounds for a Penalized Multi-class Semi-supervised Algorithm

An unsupervised classification approach for hyperspectral images based adaptive spatial and spectral neighborhood selection and graph clustering

Identifying genotype-phenotype relationships in biomedical text

Robust discriminative tracking via structured prior regularization

Neural Network for Nanoscience Scanning Electron Microscope Image Recognition

Natural-Language-Processing Techniques for Oil and Gas Drilling Data

Harbor Water Area Extraction From Pan-Sharpened Remotely Sensed Images Based on the Definition Circle Model

Clinical Research Informatics: Contributions from 2016

Sparse Representation-Based Semi-Supervised Regression for People Counting

Deep Label Distribution Learning With Label Ambiguity.

Clinical Research Informatics: Contributions from 2016.

Improving diagnostic recognition of primary hyperparathyroidism with machine learning

A RobustGNSSLOS/NLOS Signal Classifier

Improving semi-supervised self-training with embedded manifold transduction