Data Quality Considerations for Petrophysical Machine-Learning Models

Decades of subsurface exploration and characterization have led to the collation and storage of large volumes of well-related data. The amount of data gathered daily continues to grow rapidly as technology and recording methods improve. With the increasing adoption of machine-learning techniques in the subsurface domain, it is essential that the quality of the input data is carefully considered when working with these tools. If the input data are of poor quality, the impact on precision and accuracy of the prediction can be significant. Consequently, this can impact key decisions about the future of a well or a field. This study focuses on well-log data, which can be highly multidimensional, diverse, and stored in a variety of file formats. Well-log data exhibits key characteristics of big data: volume, variety, velocity, veracity, and value. Well data can include numeric values, text values, waveform data, image arrays, maps, and volumes. All of which can be indexed by time or depth in a regular or irregular way. A significant portion of time can be spent gathering data and quality checking it prior to carrying out petrophysical interpretations and applying machine-learning models. Well-log data can be affected by numerous issues causing a degradation in data quality. These include missing data ranging from single data points to entire curves, noisy data from tool-related issues, borehole washout, processing issues, incorrect environmental corrections, and mislabeled data. Having vast quantities of data does not mean it can all be passed into a machine-learning algorithm with the expectation that the resultant prediction is fit for purpose. It is essential that the most important and relevant data are passed into the model through appropriate feature selection techniques. Not only does this improve the quality of the prediction, but it also reduces computational time and can provide a better understanding of how the models reach their conclusion. This paper reviews data quality issues typically faced by petrophysicists when working with well-log data and deploying machine-learning models. This is achieved by first providing an overview of machine learning and big data within the petrophysical domain, followed by a review of the common well-log data issues, their impact on machine-learning algorithms, and methods for mitigating their influence.

Download Full-text

DATA QUALITY CONSIDERATIONS FOR PETROPHYSICAL MACHINE LEARNING MODELS

10.30632/spwla-2021-0036 ◽

2021 ◽

Author(s):

Andrew McDonald ◽

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Quality ◽

Input Data ◽

Well Log ◽

Learning Models ◽

Log Data ◽

Quality Issues ◽

Machine Learning Models

Decades of subsurface exploration and characterisation have led to the collation and storage of large volumes of well related data. The amount of data gathered daily continues to grow rapidly as technology and recording methods improve. With the increasing adoption of machine learning techniques in the subsurface domain, it is essential that the quality of the input data is carefully considered when working with these tools. If the input data is of poor quality, the impact on precision and accuracy of the prediction can be significant. Consequently, this can impact key decisions about the future of a well or a field. This study focuses on well log data, which can be highly multi-dimensional, diverse and stored in a variety of file formats. Well log data exhibits key characteristics of Big Data: Volume, Variety, Velocity, Veracity and Value. Well data can include numeric values, text values, waveform data, image arrays, maps, volumes, etc. All of which can be indexed by time or depth in a regular or irregular way. A significant portion of time can be spent gathering data and quality checking it prior to carrying out petrophysical interpretations and applying machine learning models. Well log data can be affected by numerous issues causing a degradation in data quality. These include missing data - ranging from single data points to entire curves; noisy data from tool related issues; borehole washout; processing issues; incorrect environmental corrections; and mislabelled data. Having vast quantities of data does not mean it can all be passed into a machine learning algorithm with the expectation that the resultant prediction is fit for purpose. It is essential that the most important and relevant data is passed into the model through appropriate feature selection techniques. Not only does this improve the quality of the prediction, it also reduces computational time and can provide a better understanding of how the models reach their conclusion. This paper reviews data quality issues typically faced by petrophysicists when working with well log data and deploying machine learning models. First, an overview of machine learning and Big Data is covered in relation to petrophysical applications. Secondly, data quality issues commonly faced with well log data are discussed. Thirdly, methods are suggested on how to deal with data issues prior to modelling. Finally, multiple case studies are discussed covering the impacts of data quality on predictive capability.

Download Full-text

Environmental assessment based surface water quality prediction using hyper-parameter optimized machine learning models based on consistent big data

Process Safety and Environmental Protection ◽

10.1016/j.psep.2021.05.026 ◽

2021 ◽

Author(s):

Muhammad Izhar Shah ◽

Muhammad Faisal Javed ◽

Abdulaziz Alqahtani ◽

Ali Aldrees

Keyword(s):

Machine Learning ◽

Water Quality ◽

Big Data ◽

Surface Water ◽

Environmental Assessment ◽

Surface Water Quality ◽

Quality Prediction ◽

Learning Models ◽

Water Quality Prediction ◽

Machine Learning Models

Download Full-text

Application of Bioactivity Profile Based Fingerprints for Building Machine Learning Models

10.26434/chemrxiv.6969584 ◽

2018 ◽

Cited By ~ 1

Author(s):

Noé Sturm ◽

Jiangming Sun ◽

Yves Vandriessche ◽

Andreas Mayr ◽

Günter Klambauer ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

High Throughput ◽

Scaffold Hopping ◽

Learning Models ◽

Industrial Data ◽

Structural Descriptors ◽

Bioactivity Profile ◽

Machine Learning Models

<div>This article describes an application of high-throughput fingerprints (HTSFP) built upon industrial data accumulated over the years. </div><div>The fingerprint was used to build machine learning models (multi-task deep learning + SVM) for compound activity predictions towards a panel of 131 targets. </div><div>Quality of the predictions and the scaffold hopping potential of the HTSFP were systematically compared to traditional structural descriptors ECFP. </div><div><br></div>

Download Full-text

Automated Retraining of Machine Learning Models

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3322.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 445-452

Keyword(s):

Machine Learning ◽

Input Data ◽

Research Work ◽

Learning Models ◽

Machine Learning Methods ◽

Machine Learning Model ◽

Crucial Component ◽

Conventional Machine ◽

Over Time ◽

Machine Learning Models

Data is the most crucial component of a successful ML system. Once a machine learning model is developed, it gets obsolete over time due to presence of new input data being generated every second. In order to keep our predictions accurate we need to find a way to keep our models up to date. Our research work involves finding a mechanism which can retrain the model with new data automatically. This research also involves exploring the possibilities of automating machine learning processes. We started this project by training and testing our model using conventional machine learning methods. The outcome was then compared with the outcome of those experiments conducted using the AutoML methods like TPOT. This helped us in finding an efficient technique to retrain our models. These techniques can be used in areas where people do not deal with the actual working of a ML model but only require the outputs of ML processes

Download Full-text

Improving Real-Time Drilling Data Quality Using Artificial Intelligence and Machine Learning Techniques

10.2118/204658-ms ◽

2021 ◽

Author(s):

S. H. Al Gharbi ◽

A. A. Al-Majed ◽

A. Abdulraheem ◽

S. Patil ◽

S. M. Elkatatny

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Data Quality ◽

Real Time ◽

Input Data ◽

Support Vector ◽

The Real ◽

Drilling Data ◽

Drilling Operations

Abstract Due to high demand for energy, oil and gas companies started to drill wells in remote areas and unconventional environments. This raised the complexity of drilling operations, which were already challenging and complex. To adapt, drilling companies expanded their use of the real-time operation center (RTOC) concept, in which real-time drilling data are transmitted from remote sites to companies’ headquarters. In RTOC, groups of subject matter experts monitor the drilling live and provide real-time advice to improve operations. With the increase of drilling operations, processing the volume of generated data is beyond a human's capability, limiting the RTOC impact on certain components of drilling operations. To overcome this limitation, artificial intelligence and machine learning (AI/ML) technologies were introduced to monitor and analyze the real-time drilling data, discover hidden patterns, and provide fast decision-support responses. AI/ML technologies are data-driven technologies, and their quality relies on the quality of the input data: if the quality of the input data is good, the generated output will be good; if not, the generated output will be bad. Unfortunately, due to the harsh environments of drilling sites and the transmission setups, not all of the drilling data is good, which negatively affects the AI/ML results. The objective of this paper is to utilize AI/ML technologies to improve the quality of real-time drilling data. The paper fed a large real-time drilling dataset, consisting of over 150,000 raw data points, into Artificial Neural Network (ANN), Support Vector Machine (SVM) and Decision Tree (DT) models. The models were trained on the valid and not-valid datapoints. The confusion matrix was used to evaluate the different AI/ML models including different internal architectures. Despite the slowness of ANN, it achieved the best result with an accuracy of 78%, compared to 73% and 41% for DT and SVM, respectively. The paper concludes by presenting a process for using AI technology to improve real-time drilling data quality. To the author's knowledge based on literature in the public domain, this paper is one of the first to compare the use of multiple AI/ML techniques for quality improvement of real-time drilling data. The paper provides a guide for improving the quality of real-time drilling data.

Download Full-text

Machine Learning Models and Algorithms for Big Data Classification

10.1007/978-1-4899-7641-3 ◽

2016 ◽

Cited By ~ 41

Author(s):

Shan Suthaharan

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Classification ◽

Learning Models ◽

Big Data Classification ◽

Machine Learning Models

Download Full-text

Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data

Water Research ◽

10.1016/j.watres.2019.115454 ◽

2020 ◽

Vol 171 ◽

pp. 115454 ◽

Cited By ~ 9

Author(s):

Kangyang Chen ◽

Hexia Chen ◽

Chuanlong Zhou ◽

Yichao Huang ◽

Xiangyang Qi ◽

...

Keyword(s):

Machine Learning ◽

Water Quality ◽

Big Data ◽

Surface Water Quality ◽

Prediction Performance ◽

Quality Prediction ◽

Learning Models ◽

Water Parameters ◽

Water Quality Prediction ◽

Machine Learning Models

Download Full-text

Forecasting residential gas consumption with machine learning algorithms on weather data

E3S Web of Conferences ◽

10.1051/e3sconf/201911105019 ◽

2019 ◽

Vol 111 ◽

pp. 05019

Author(s):

Brian de Keijzer ◽

Pol de Visser ◽

Víctor García Romillo ◽

Víctor Gómez Muñoz ◽

Daan Boesten ◽

...

Keyword(s):

Machine Learning ◽

Energy Consumption ◽

Energy Use ◽

Machine Learning Algorithms ◽

Weather Data ◽

Computational Time ◽

Percentage Error ◽

Learning Models ◽

Gas Consumption ◽

Machine Learning Models

Machine learning models have proven to be reliable methods in the forecasting of energy use in commercial and office buildings. However, little research has been done on energy forecasting in dwellings, mainly due to the difficulty of obtaining household level data while keeping the privacy of inhabitants in mind. Gaining insight into the energy consumption in the near future can be helpful in balancing the grid and insights in how to reduce the energy consumption can be received. In collaboration with OPSCHALER, a measurement campaign on the influence of housing characteristics on energy costs and comfort, several machine learning models were compared on forecasting performance and the computational time needed. Nine months of data containing the mean gas consumption of 52 dwellings on a one hour resolution was used for this research. The first 6 months were used for training, whereas the last 3 months were used to evaluate the models. The results showed that the Deep Neural Network (DNN) performed best with a 50.1 % Mean Absolute Percentage Error (MAPE) on a one hour resolution. When comparing daily and weekly resolutions, the Multivariate Linear Regression (MVLR) outperformed other models, with a 20.1 % and 17.0 % MAPE, respectively. The models were programmed in Python.

Download Full-text

Cyber-Physical LPG Debutanizer Distillation Columns: Machine Learning-Based Soft Sensors for Product Quality Monitoring

10.20944/preprints202110.0364.v1 ◽

2021 ◽

Author(s):

Jože M. Rožanec ◽

Elena Trajkova ◽

Jinzhi Lu ◽

Nikolaos Sarantinoudis ◽

Georgios Arampatzis ◽

...

Keyword(s):

Machine Learning ◽

Product Quality ◽

Learning Models ◽

State Monitoring ◽

Soft Sensors ◽

Distillation Columns ◽

Operational Conditions ◽

Equipment State ◽

Machine Learning Models

Refineries execute a series of interlinked processes, where the product of one unit serves as the input to another process. Potential failures within these processes affect the quality of the end products, operational efficiency, and revenue of the entire refinery. In this context, implementation of a real-time cognitive module, referring to predictive machine learning models, enables to provide equipment state monitoring services and to generate decision-making for equipment operations. In this paper, we propose two machine learning models: 1) to forecast the amount of pentane (C5) content in the final product mixture; 2) to identify if C5 content exceeds the specification thresholds for the final product quality. We validate our approach by using a use case from a real-world refinery. In addition, we develop a visualization to assess which features are considered most important during feature selection, and later by the machine learning models. Finally, we provide insights on the sensor values in the dataset, which help to identify the operational conditions for using such machine learning models.

Download Full-text

Cyber-Physical LPG Debutanizer Distillation Columns: Machine-Learning-Based Soft Sensors for Product Quality Monitoring

Applied Sciences ◽

10.3390/app112411790 ◽

2021 ◽

Vol 11 (24) ◽

pp. 11790

Author(s):

Jože Martin Rožanec ◽

Elena Trajkova ◽

Jinzhi Lu ◽

Nikolaos Sarantinoudis ◽

George Arampatzis ◽

...

Keyword(s):

Machine Learning ◽

Product Quality ◽

Learning Models ◽

State Monitoring ◽

Soft Sensors ◽

Distillation Columns ◽

Operational Conditions ◽

Equipment State ◽

Machine Learning Models

Refineries execute a series of interlinked processes, where the product of one unit serves as the input to another process. Potential failures within these processes affect the quality of the end products, operational efficiency, and revenue of the entire refinery. In this context, implementation of a real-time cognitive module, referring to predictive machine learning models, enables the provision of equipment state monitoring services and the generation of decision-making for equipment operations. In this paper, we propose two machine learning models: (1) to forecast the amount of pentane (C5) content in the final product mixture; (2) to identify if C5 content exceeds the specification thresholds for the final product quality. We validate our approach using a use case from a real-world refinery. In addition, we develop a visualization to assess which features are considered most important during feature selection, and later by the machine learning models. Finally, we provide insights on the sensor values in the dataset, which help to identify the operational conditions for using such machine learning models.

Download Full-text