A Critical Literature Review on Rock Petrophysical Properties Estimation from Images Based on Direct Simulation and Machine Learning Techniques

Abstract Estimation of petrophysical properties is essential for accurate reservoir predictions. In recent years, extensive work has been dedicated into training different machine-learning (ML) models to predict petrophysical properties of digital rock using dry rock images along with data from single-phase direct simulations, such as lattice Boltzmann method (LBM) and finite volume method (FVM). The objective of this paper is to present a comprehensive literature review on petrophysical properties estimation from dry rock images using different ML workflows and direct simulation methods. The review provides detailed comparison between different ML algorithms that have been used in the literature to estimate porosity, permeability, tortuosity, and effective diffusivity. In this paper, various ML workflows from the literature are screened and compared in terms of the training data set, the testing data set, the extracted features, the algorithms employed as well as their accuracy. A thorough description of the most commonly used algorithms is also provided to better understand the functionality of these algorithms to encode the relationship between the rock images and their respective petrophysical properties. The review of various ML workflows for estimating rock petrophysical properties from dry images shows that models trained using features extracted from the image (physics-informed models) outperformed models trained on the dry images directly. In addition, certain tree-based ML algorithms, such as random forest, gradient boosting, and extreme gradient boosting can produce accurate predictions that are comparable to deep learning algorithms such as deep neural networks (DNNs) and convolutional neural networks (CNNs). To the best of our knowledge, this is the first work dedicated to exploring and comparing between different ML frameworks that have recently been used to accurately and efficiently estimate rock petrophysical properties from images. This work will enable other researchers to have a broad understanding about the topic and help in developing new ML workflows or further modifying exiting ones in order to improve the characterization of rock properties. Also, this comparison represents a guide to understand the performance and applicability of different ML algorithms. Moreover, the review helps the researchers in this area to cope with digital innovations in porous media characterization in this fourth industrial age – oil and gas 4.0.

Download Full-text

Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00028 ◽

2020 ◽

pp. 865-874

Author(s):

Enrico Santus ◽

Tal Schuster ◽

Amir M. Tahmasebi ◽

Clara Li ◽

Adam Yala ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Systems ◽

High Performance ◽

Feature Model ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Extreme Gradient Boosting ◽

Pathology Reports

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.

Download Full-text

Applying Deep Neural Networks and Ensemble Machine Learning Methods to Forecast Airborne Ambrosia Pollen

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph16111992 ◽

2019 ◽

Vol 16 (11) ◽

pp. 1992 ◽

Cited By ~ 6

Author(s):

Gebreab K. Zewdie ◽

David J. Lary ◽

Estelle Levetin ◽

Gemechu F. Garuma

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Land Surface ◽

Deep Neural Networks ◽

Airborne Pollen ◽

Training Data ◽

Gradient Boosting ◽

Learning Approaches ◽

Ambrosia Pollen ◽

Extreme Gradient Boosting

Allergies to airborne pollen are a significant issue affecting millions of Americans. Consequently, accurately predicting the daily concentration of airborne pollen is of significant public benefit in providing timely alerts. This study presents a method for the robust estimation of the concentration of airborne Ambrosia pollen using a suite of machine learning approaches including deep learning and ensemble learners. Each of these machine learning approaches utilize data from the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric weather and land surface reanalysis. The machine learning approaches used for developing a suite of empirical models are deep neural networks, extreme gradient boosting, random forests and Bayesian ridge regression methods for developing our predictive model. The training data included twenty-four years of daily pollen concentration measurements together with ECMWF weather and land surface reanalysis data from 1987 to 2011 is used to develop the machine learning predictive models. The last six years of the dataset from 2012 to 2017 is used to independently test the performance of the machine learning models. The correlation coefficients between the estimated and actual pollen abundance for the independent validation datasets for the deep neural networks, random forest, extreme gradient boosting and Bayesian ridge were 0.82, 0.81, 0.81 and 0.75 respectively, showing that machine learning can be used to effectively forecast the concentrations of airborne pollen.

Download Full-text

Natural language processing systems for data extraction and mapping on the basis of unstructured text blocks

Proceedings of the International conference “InterCarto/InterGIS” ◽

10.35595/2414-9179-2020-3-26-53-61 ◽

2020 ◽

Vol 26 (3) ◽

pp. 53-61

Author(s):

Pavel Kikin ◽

Alexey Kolesnikov ◽

Alexey Portnov ◽

Denis Grischenko

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Mathematical Models ◽

Optimal Algorithm ◽

The State ◽

Gradient Boosting ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods ◽

Spatio Temporal

The state of ecological systems, along with their general characteristics, is almost always described by indicators that vary in space and time, which leads to a significant complication of constructing mathematical models for predicting the state of such systems. One of the ways to simplify and automate the construction of mathematical models for predicting the state of such systems is the use of machine learning methods. The article provides a comparison of traditional and based on neural networks, algorithms and machine learning methods for predicting spatio-temporal series representing ecosystem data. Analysis and comparison were carried out among the following algorithms and methods: logistic regression, random forest, gradient boosting on decision trees, SARIMAX, neural networks of long-term short-term memory (LSTM) and controlled recurrent blocks (GRU). To conduct the study, data sets were selected that have both spatial and temporal components: the values of the number of mosquitoes, the number of dengue infections, the physical condition of tropical grove trees, and the water level in the river. The article discusses the necessary steps for preliminary data processing, depending on the algorithm used. Also, Kolmogorov complexity was calculated as one of the parameters that can help formalize the choice of the most optimal algorithm when constructing mathematical models of spatio-temporal data for the sets used. Based on the results of the analysis, recommendations are given on the application of certain methods and specific technical solutions, depending on the characteristics of the data set that describes a particular ecosystem

Download Full-text

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis (Preprint)

10.2196/preprints.27344 ◽

2021 ◽

Author(s):

Sang Min Nam ◽

Thomas A Peterson ◽

Kyoung Yul Seo ◽

Hyun Wook Han ◽

Jee In Kang

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Network Analysis ◽

Survey Data ◽

Associated Factors ◽

Statistical Tests ◽

Epidemiological Studies ◽

Gradient Boosting ◽

Data Set ◽

Extreme Gradient Boosting

BACKGROUND In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. OBJECTIVE Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. METHODS An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. RESULTS The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (<i>P</i><.05) and indirect (<i>P</i>≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. CONCLUSIONS XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.

Download Full-text

A Machine Learning Study to Improve Surgical Case Duration Prediction

10.21203/rs.3.rs-40927/v1 ◽

2020 ◽

Author(s):

Ching-Chieh Huang ◽

Jesyin Lai ◽

Der-Yang Cho ◽

Jiaxin Yu

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Healthcare Management ◽

Gradient Boosting ◽

External Evaluation ◽

Data Set ◽

Surgical Case ◽

Case Duration ◽

Extreme Gradient Boosting ◽

Duration Prediction

Abstract Since the emergence of COVID-19, many hospitals have encountered challenges in performing efficient scheduling and good resource management to ensure the quality of healthcare provided to patients is not compromised. Operating room (OR) scheduling is one of the issues that has gained our attention because it is related to workflow efficiency and critical care of hospitals. Automatic scheduling and high predictive accuracy of surgical case duration have a critical role in improving OR utilization. To estimate surgical case duration, many hospitals rely on historic averages based on a specific surgeon or a specific procedure type obtained from electronic medical record (EMR) scheduling systems. However, the low predictive accuracy with EMR data leads to negative impacts on patients and hospitals, such as rescheduling of surgeries and cancellation. In this study, we aim to improve the prediction of surgical case duration with advanced machine learning (ML) algorithms. We obtained a large data set containing 170,748 surgical cases (from Jan 2017 to Dec 2019) from a hospital. The data covered a broad variety of details on patients, surgeries, specialties and surgical teams. In addition, a more recent data set with 8,672 cases (from Mar to Apr 2020) was available to be used for external evaluation. We computed historic averages from the EMR data for surgeon- or procedure-specific cases, and they were used as baseline models for comparison. Subsequently, we developed our models using linear regression, random forest and extreme gradient boosting (XGB) algorithms. All models were evaluated with R-square (R2), mean absolute error (MAE), and percentage overage (actual duration longer than prediction), underage (shorter than prediction) and within (within prediction). The XGB model was superior to the other models, achieving a higher R2 (85 %) and percentage within (48 %) as well as a lower MAE (30.2 min). The total prediction errors computed for all models showed that the XGB model had the lowest inaccurate percentage (23.7 %). Overall, this study applied ML techniques in the field of OR scheduling to reduce the medical and financial burden for healthcare management. The results revealed the importance of surgery and surgeon factors in surgical case duration prediction. This study also demonstrated the importance of performing an external evaluation to better validate the performance of ML models.

Download Full-text

Application of Machine Learning to Interpret Steady State Drainage Relative Permeability Experiments

10.2118/207877-ms ◽

2021 ◽

Author(s):

Eric Sonny Mathew ◽

Moussa Tembely ◽

Waleed AlAmeri ◽

Emad W. Al-Shalabi ◽

Abdul Ravoof Shaik

Keyword(s):

Neural Network ◽

Machine Learning ◽

Experimental Data ◽

Steady State ◽

Relative Permeability ◽

Learning Model ◽

Gradient Boosting ◽

Data Set ◽

Machine Learning Model ◽

Extreme Gradient Boosting

Abstract A meticulous interpretation of steady-state or unsteady-state relative permeability (Kr) experimental data is required to determine a complete set of Kr curves. In this work, three different machine learning models was developed to assist in a faster estimation of these curves from steady-state drainage coreflooding experimental runs. The three different models that were tested and compared were extreme gradient boosting (XGB), deep neural network (DNN) and recurrent neural network (RNN) algorithms. Based on existing mathematical models, a leading edge framework was developed where a large database of Kr and Pc curves were generated. This database was used to perform thousands of coreflood simulation runs representing oil-water drainage steady-state experiments. The results obtained from these simulation runs, mainly pressure drop along with other conventional core analysis data, were utilized to estimate Kr curves based on Darcy's law. These analytically estimated Kr curves along with the previously generated Pc curves were fed as features into the machine learning model. The entire data set was split into 80% for training and 20% for testing. K-fold cross validation technique was applied to increase the model accuracy by splitting the 80% of the training data into 10 folds. In this manner, for each of the 10 experiments, 9 folds were used for training and the remaining one was used for model validation. Once the model is trained and validated, it was subjected to blind testing on the remaining 20% of the data set. The machine learning model learns to capture fluid flow behavior inside the core from the training dataset. The trained/tested model was thereby employed to estimate Kr curves based on available experimental results. The performance of the developed model was assessed using the values of the coefficient of determination (R2) along with the loss calculated during training/validation of the model. The respective cross plots along with comparisons of ground-truth versus AI predicted curves indicate that the model is capable of making accurate predictions with error percentage between 0.2 and 0.6% on history matching experimental data for all the three tested ML techniques (XGB, DNN, and RNN). This implies that the AI-based model exhibits better efficiency and reliability in determining Kr curves when compared to conventional methods. The results also include a comparison between classical machine learning approaches, shallow and deep neural networks in terms of accuracy in predicting the final Kr curves. The various models discussed in this research work currently focusses on the prediction of Kr curves for drainage steady-state experiments; however, the work can be extended to capture the imbibition cycle as well.

Download Full-text

Performance of Machine Learning Algorithms and Diversity in Data

MATEC Web of Conferences ◽

10.1051/matecconf/201821004019 ◽

2018 ◽

Vol 210 ◽

pp. 04019 ◽

Cited By ~ 1

Author(s):

Hyontai SUG

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Real World ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Real World Data ◽

Random Data ◽

Data Set ◽

World Data

Recent world events in go games between human and artificial intelligence called AlphaGo showed the big advancement in machine learning technologies. While AlphaGo was trained using real world data, AlphaGo Zero was trained using massive random data, and the fact that AlphaGo Zero won AlphaGo completely revealed that diversity and size in training data is important for better performance for the machine learning algorithms, especially in deep learning algorithms of neural networks. On the other hand, artificial neural networks and decision trees are widely accepted machine learning algorithms because of their robustness in errors and comprehensibility respectively. In this paper in order to prove that diversity and size in data are important factors for better performance of machine learning algorithms empirically, the two representative algorithms are used for experiment. A real world data set called breast tissue was chosen, because the data set consists of real numbers that is very good property for artificial random data generation. The result of the experiment proved the fact that the diversity and size of data are very important factors for better performance.

Download Full-text

Prediction of Water Saturation from Well Log Data by Machine Learning Algorithms: Boosting and Super Learner

Journal of Marine Science and Engineering ◽

10.3390/jmse9060666 ◽

2021 ◽

Vol 9 (6) ◽

pp. 666

Author(s):

Fahimeh Hadavimoghaddam ◽

Mehdi Ostadhassan ◽

Mohammad Ali Sadri ◽

Tatiana Bondarenko ◽

Igor Chebyshev ◽

...

Keyword(s):

Machine Learning ◽

Water Saturation ◽

Machine Learning Algorithms ◽

Rock Properties ◽

Gradient Boosting ◽

Data Set ◽

Log Data ◽

Gamma Density ◽

Super Learner ◽

Resistivity Log

Intelligent predictive methods have the power to reliably estimate water saturation (Sw) compared to conventional experimental methods commonly performed by petrphysicists. However, due to nonlinearity and uncertainty in the data set, the prediction might not be accurate. There exist new machine learning (ML) algorithms such as gradient boosting techniques that have shown significant success in other disciplines yet have not been examined for Sw prediction or other reservoir or rock properties in the petroleum industry. To bridge the literature gap, in this study, for the first time, a total of five ML code programs that belong to the family of Super Learner along with boosting algorithms: XGBoost, LightGBM, CatBoost, AdaBoost, are developed to predict water saturation without relying on the resistivity log data. This is important since conventional methods of water saturation prediction that rely on resistivity log can become problematic in particular formations such as shale or tight carbonates. Thus, to do so, two datasets were constructed by collecting several types of well logs (Gamma, density, neutron, sonic, PEF, and without PEF) to evaluate the robustness and accuracy of the models by comparing the results with laboratory-measured data. It was found that Super Learner and XGBoost produced the highest accurate output (R2: 0.999 and 0.993, respectively), and with considerable distance, Catboost and LightGBM were ranked third and fourth, respectively. Ultimately, both XGBoost and Super Learner produced negligible errors but the latest is considered as the best amongst all.

Download Full-text

I NTRODUCING A NEW T ECHNICAL I NDICATOR BASED ON OCTAV O NICESCU I NFORMATIONAL E NERGY AND COMPARE IT WITH B OLLINGER BANDS FOR S&P 500 M OVEMENT P REDICTIONS

10.31219/osf.io/m478b ◽

2019 ◽

Author(s):

Daia Alexandru

Keyword(s):

Machine Learning ◽

Kinetic Energy ◽

Stock Market ◽

Historical Data ◽

Gradient Boosting ◽

Data Set ◽

Extreme Gradient Boosting ◽

The Future ◽

Tangible Evidence ◽

Market Trends

This research paper demonstrates the invention of the kinetic bands, based on Romanian mathematician and statistician Octav Onicescu’s kinetic energy, also known as “informational energy”, where we use historical data of foreign exchange currencies or indexes to predict the trend displayed by a stock or an index and whether it will go up or down in the future. Here, we explore the imperfections of the Bollinger Bands to determine a more sophisticated triplet of indicators that predict the future movement of prices in the Stock Market. An Extreme Gradient Boosting Modelling was conducted in Python using historical data set from Kaggle, the historical data set spanning all current 500 companies listed. An invariable importance feature was plotted. The results displayed that Kinetic Bands, derived from (KE) are very influential as features or technical indicators of stock market trends. Furthermore, experiments done through this invention provide tangible evidence of the empirical aspects of it. The machine learning code has low chances of error if all the proper procedures and coding are in play. The experiment samples are attached to this study for future references or scrutiny.

Download Full-text

Petrofacies classification using machine learning algorithms

Geophysics ◽

10.1190/geo2019-0439.1 ◽

2020 ◽

Vol 85 (4) ◽

pp. WA101-WA113 ◽

Cited By ~ 3

Author(s):

Adrielle A. Silva ◽

Mônica W. Tavares ◽

Abel Carrasquilla ◽

Roseane Misságia ◽

Marco Ceia

Keyword(s):

Machine Learning ◽

Oil And Gas ◽

Water Saturation ◽

Carbonate Reservoir ◽

Machine Learning Algorithms ◽

Carbonate Reservoirs ◽

Training Data ◽

Southeastern Brazil ◽

Gradient Boosting ◽

Data Set

Carbonate reservoirs represent a large portion of the world’s oil and gas reserves, exhibiting specific characteristics that pose complex challenges to the reservoirs’ characterization, production, and management. Therefore, the evaluation of the relationships between the key parameters, such as porosity, permeability, water saturation, and pore size distribution, is a complex task considering only well-log data, due to the geologic heterogeneity. Hence, the petrophysical parameters are the key to assess the original composition and postsedimentological aspects of the carbonate reservoirs. The concept of reservoir petrofacies was proposed as a tool for the characterization and prediction of the reservoir quality as it combines primary textural analysis with laboratory measurements of porosity, permeability, capillary pressure, photomicrograph descriptions, and other techniques, which contributes to understanding the postdiagenetic events. We have adopted a workflow to petrofacies classification of a carbonate reservoir from the Campos Basin in southeastern Brazil, using the following machine learning methods: decision tree, random forest, gradient boosting, K-nearest neighbors, and naïve Bayes. The data set comprised 1477 wireline data from two wells (A3 and A10) that had petrofacies classes already assigned based on core descriptions. It was divided into two subsets, one for training and one for testing the capability of the trained models to assign petrofacies. The supervised-learning models have used labeled training data to learn the relationships between the input measurements and the petrofacies to be assigned. Additionally, we have developed a comparison of the models’ performance using the testing set according to accuracy, precision, recall, and F1-score evaluation metrics. Our approach has proved to be a valuable ally in petrofacies classification, especially for analyzing a well-logging database with no prior petrophysical information.

Download Full-text