The Implementation of Machine Learning in  Lithofacies Classification using Multi Well Logs Data

Sudarmaji Saroji; Ekrar Winata; Putra Pratama Wahyu Hidayat; Suryo Prakoso; Firman Herdiansyah

doi:10.13170/aijst.10.1.18749

The Implementation of Machine Learning in Lithofacies Classification using Multi Well Logs Data

Aceh International Journal of Science and Technology ◽

10.13170/aijst.10.1.18749 ◽

2021 ◽

Vol 10 (1) ◽

pp. 9-17

Author(s):

Sudarmaji Saroji ◽

Ekrar Winata ◽

Putra Pratama Wahyu Hidayat ◽

Suryo Prakoso ◽

Firman Herdiansyah

Keyword(s):

Machine Learning ◽

Gamma Ray ◽

Distribution Patterns ◽

Effective Porosity ◽

Depositional Environments ◽

Beach Sand ◽

Support Vector ◽

Data Sets ◽

Neutron Porosity ◽

Lithofacies Classification

Lithofacies classification is a process to identify rock lithology by indirect measurements. Usually, the classification is processed manually by an experienced geoscientist. This research presents an automated lithofacies classification using a machine learning method to increase computational power in shortening the lithofacies classification process's time consumption. The support vector machine (SVM) algorithm has been applied successfully to the Damar field, Indonesia. The machine learning input is various well-log data sets, e.g., gamma-ray, density, resistivity, neutron porosity, and effective porosity. Machine learning can classify seven lithofacies and depositional environments, including channel, bar sand, beach sand, carbonate, volcanic, and shale. The classification accuracy in the verification phase with trained lithofacies class data reached more than 90%, while the accuracy in the validation phase with beyond trained data reached 65%. The classified lithofacies then can be used as the input for describing lateral and vertical rock distribution patterns.

Download Full-text

A One-shot Learning Approach to Image Classification using Genetic Programming

10.26686/wgtn.13150934.v1 ◽

2020 ◽

Author(s):

Harith Al-Sahaf ◽

Mengjie Zhang ◽

M Johnston

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Genetic Programming ◽

Image Classification ◽

Local Binary Patterns ◽

Support Vector ◽

Learning Approach ◽

Data Sets ◽

Domain Specific ◽

International Publishing

In machine learning, it is common to require a large number of instances to train a model for classification. In many cases, it is hard or expensive to acquire a large number of instances. In this paper, we propose a novel genetic programming (GP) based method to the problem of automatic image classification via adopting a one-shot learning approach. The proposed method relies on the combination of GP and Local Binary Patterns (LBP) techniques to detect a predefined number of informative regions that aim at maximising the between-class scatter and minimising the within-class scatter. Moreover, the proposed method uses only two instances of each class to evolve a classifier. To test the effectiveness of the proposed method, four different texture data sets are used and the performance is compared against two other GP-based methods namely Conventional GP and Two-tier GP. The experiments revealed that the proposed method outperforms these two methods on all the data sets. Moreover, a better performance has been achieved by Naïve Bayes, Support Vector Machine, and Decision Trees (J48) methods when extracted features by the proposed method have been used compared to the use of domain-specific and Two-tier GP extracted features. © Springer International Publishing 2013.

Download Full-text

Identification of DNA-binding proteins via Hypergraph based Laplacian Support Vector Machine

Current Bioinformatics ◽

10.2174/1574893616666210806091922 ◽

2021 ◽

Vol 16 ◽

Author(s):

Yuqing Qian ◽

Hao Meng ◽

Weizhong Lu ◽

Zhijun Liao ◽

Yijie Ding ◽

...

Keyword(s):

Machine Learning ◽

Dna Binding ◽

Large Scale ◽

Binding Proteins ◽

Predictive Accuracy ◽

Dna Binding Proteins ◽

Research Field ◽

Support Vector ◽

Data Sets ◽

Independent Test

Background: The identification of DNA binding proteins (DBP) is an important research field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP. Objective: To solve the problem of large-scale DBP identification, some machine learning methods are proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence-based machine learning model to predict DBP. Methods: In our study, we extract six types of features (including NMBAC, GE, MCD, PSSM-AB, PSSM-DWT, and PsePSSM) from protein sequences. We use Multiple Kernel Learning based on Hilbert-Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we construct a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets. Result: Compared with other methods, our model achieves best results on benchmark data sets. Conclusion: The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of PDB1075) and PDB2272 (Independent test of PDB14189), respectively.

Download Full-text

Parallel Tuning of Support Vector Machine Learning Parameters for Large and Unbalanced Data Sets

Lecture Notes in Computer Science - Computational Life Sciences ◽

10.1007/11560500_23 ◽

2005 ◽

pp. 253-264 ◽

Cited By ~ 5

Author(s):

Tatjana Eitrich ◽

Bruno Lang

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Support Vector ◽

Unbalanced Data ◽

Data Sets

Download Full-text

Hamming Distance based Clustering Algorithm

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2012010102 ◽

2012 ◽

Vol 2 (1) ◽

pp. 11-20 ◽

Cited By ~ 3

Author(s):

Ritu Vijay ◽

Prerna Mahajan ◽

Rekha Kandwal

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Hamming Distance ◽

Promising Result ◽

Clustering Algorithms ◽

Distribution Patterns ◽

Mixed Data ◽

Binary Representation ◽

Data Sets ◽

Performance Study

Cluster analysis has been extensively used in machine learning and data mining to discover distribution patterns in the data. Clustering algorithms are generally based on a distance metric in order to partition the data into small groups such that data instances in the same group are more similar than the instances belonging to different groups. In this paper the authors have extended the concept of hamming distance for categorical data .As a data processing step they have transformed the data into binary representation. The authors have used proposed algorithm to group data points into clusters. The experiments are carried out on the data sets from UCI machine learning repository to analyze the performance study. They conclude by stating that this proposed algorithm shows promising result and can be extended to handle numeric as well as mixed data.

Download Full-text

Comparison of the Validity and Generalizability of Machine Learning Algorithms for the Prediction of Energy Expenditure: Validation Study

JMIR mhealth and uhealth ◽

10.2196/23938 ◽

2021 ◽

Vol 9 (8) ◽

pp. e23938

Author(s):

Ruairi O'Driscoll ◽

Jake Turicchi ◽

Mark Hopkins ◽

Cristiana Duarte ◽

Graham W Horgan ◽

...

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Neural Networks ◽

Random Forest ◽

Energy Expenditure ◽

Superior Performance ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Out Of Sample

Background Accurate solutions for the estimation of physical activity and energy expenditure at scale are needed for a range of medical and health research fields. Machine learning techniques show promise in research-grade accelerometers, and some evidence indicates that these techniques can be applied to more scalable commercial devices. Objective This study aims to test the validity and out-of-sample generalizability of algorithms for the prediction of energy expenditure in several wearables (ie, Fitbit Charge 2, ActiGraph GT3-x, SenseWear Armband Mini, and Polar H7) using two laboratory data sets comprising different activities. Methods Two laboratory studies (study 1: n=59, age 44.4 years, weight 75.7 kg; study 2: n=30, age=31.9 years, weight=70.6 kg), in which adult participants performed a sequential lab-based activity protocol consisting of resting, household, ambulatory, and nonambulatory tasks, were combined in this study. In both studies, accelerometer and physiological data were collected from the wearables alongside energy expenditure using indirect calorimetry. Three regression algorithms were used to predict metabolic equivalents (METs; ie, random forest, gradient boosting, and neural networks), and five classification algorithms (ie, k-nearest neighbor, support vector machine, random forest, gradient boosting, and neural networks) were used for physical activity intensity classification as sedentary, light, or moderate to vigorous. Algorithms were evaluated using leave-one-subject-out cross-validations and out-of-sample validations. Results The root mean square error (RMSE) was lowest for gradient boosting applied to SenseWear and Polar H7 data (0.91 METs), and in the classification task, gradient boost applied to SenseWear and Polar H7 was the most accurate (85.5%). Fitbit models achieved an RMSE of 1.36 METs and 78.2% accuracy for classification. Errors tended to increase in out-of-sample validations with the SenseWear neural network achieving RMSE values of 1.22 METs in the regression tasks and the SenseWear gradient boost and random forest achieving an accuracy of 80% in classification tasks. Conclusions Algorithms trained on combined data sets demonstrated high predictive accuracy, with a tendency for superior performance of random forests and gradient boosting for most but not all wearable devices. Predictions were poorer in the between-study validations, which creates uncertainty regarding the generalizability of the tested algorithms.

Download Full-text

Estimating the Total Organic Carbon for Unconventional Shale Resources During the Drilling Process: A Machine Learning Approach

Journal of Energy Resources Technology ◽

10.1115/1.4051737 ◽

2021 ◽

pp. 1-26

Author(s):

Ahmed Mahmoud ◽

Hany Gamal ◽

Salaheldin Elkatatny ◽

Ahmed Alsaihati

Keyword(s):

Machine Learning ◽

Organic Carbon ◽

Total Organic Carbon ◽

Gamma Ray ◽

Fuzzy Inference ◽

Support Vector ◽

Drilling Process ◽

Validation Data ◽

Inference System ◽

Data Points

Abstract Total organic carbon (TOC) is an essential parameter that indicates the quality of unconventional reservoirs. In this study, four machine learning (ML) algorithms of the adaptive neuro-fuzzy inference system (ANFIS), support vector regression (SVR), functional neural networks (FNN), and random forests (RF) were optimized to evaluate the TOC. The novelty of this work is that the optimized models predict the TOC from the bulk gamma-ray (GR) and spectral GR logs of uranium, thorium, and potassium only. The ML algorithms were trained on 749 datasets from Well-1, tested on 226 datasets from Well-2, and validated on 73 data points from Well-3. The predictability of the optimized algorithms was also compared with the available equations. The results of this study indicated that the optimized ANFIS, SVR, and RF models overperformed the available empirical equations in predicting the TOC. For validation data of Well-3, the optimized ANFIS, SVR, and RF algorithms predicted the TOC with AAPE's of 10.6%, 12.0%, and 8.9%, respectively, compared with the AAPE of 21.1% when the FNN model was used. While for the same data, the TOC was assessed with AAPE's of 48.6%, 24.6%, 20.2%, and 17.8% when Schmoker model, ΔlogR method, Zhao et al. correlation, and Mahmoud et al. correlation was used, respectively. The optimized models could be applied to estimate the TOC during the drilling process if the drillstring is provided with GR and spectral GR logging tools.

Download Full-text

Selecting optimum log measurements for hydraulic fracturing

Interpretation ◽

10.1190/int-2015-0129.1 ◽

2016 ◽

Vol 4 (2) ◽

pp. SF125-SF135

Author(s):

Mehdi E. Far ◽

John A. Quirein ◽

Natasa Mekic

Keyword(s):

Hydraulic Fracturing ◽

Gamma Ray ◽

Correlation Coefficients ◽

Effective Porosity ◽

Eagle Ford ◽

Industry Standard ◽

Neutron Porosity ◽

Using Data ◽

Mean Square Errors ◽

Selection Of

We have developed a statistical method for investigating the importance of different log measurements for picking the best zones for hydraulic fracturing. We have determined the method’s applicability using data from unconventional reservoirs (Eagle Ford, Haynesville, Barnett, and a reservoir from the Middle East). The analysis began with single log measurements (e.g., gamma ray [GR], compressional and shear sonic [DTC and DTS], and spectral gamma ray [SGR], which could measure the radioactivity of uranium [U], potassium [K], and thorium [Th]). Other types of measurements, including density (RhoB), neutron porosity (NPHI), and resistivity, were added to obtain more complex logging suites. These log measurements were the inputs for this analysis. Each input combination was referred to as a “scenario.” Parameters such as effective porosity (PhiE), brittleness, total organic carbon (TOC), production index (PI), and fracture index (FI) were referred to as the outputs for the analysis. We have investigated linear and nonlinear combinations of the inputs to predict the outputs. Various scenarios, beginning with the simplest cases and ending with the most complete combination, were tested. The selection of log combinations was either based on the importance of individual logs or on industry-standard combinations (such as triple and quad combos). For each scenario, we computed correlation coefficients and root-mean-square errors of predicting the output parameters. The prediction accuracies generally increased as a result of increasing the number of input logs. Our analysis clearly found the importance of using SGR (for PI and FI prediction) and resistivity (for TOC prediction) logs. Based on comparison of the reconstruction results, actual values, and correlation coefficients/errors, we ranked the log combinations for predicting/modeling a specific parameter. The most challenging properties to model included TOC, PhiE, PI, and FI; the easiest properties to predict were brittleness and Young’s modulus.

Download Full-text

Assessment of machine-learning techniques on large pathology data sets to address assay redundancy in routine liver function test profiles

Diagnosis ◽

10.1515/dx-2014-0063 ◽

2015 ◽

Vol 2 (1) ◽

pp. 41-51 ◽

Cited By ~ 6

Author(s):

Brett A. Lidbury ◽

Alice M. Richardson ◽

Tony Badrick

Keyword(s):

Machine Learning ◽

Liver Function ◽

Decision Trees ◽

Clinical Chemistry ◽

Laboratory Data ◽

Support Vector ◽

Specific Reference ◽

Data Sets ◽

Chemistry Data ◽

Glutamyl Transferase

AbstractRoutine liver function tests (LFTs) are central to serum testing profiles, particularly in community medicine. However there is concern about the redundancy of information provided to requesting clinicians. Large quantities of clinical laboratory data and advances in computational knowledge discovery methods provide opportunities to re-examine the value of individual routine laboratory results that combine for LFT profiles.The machine learning methods recursive partitioning (decision trees) and support vector machines (SVMs) were applied to aggregate clinical chemistry data that included elevated LFT profiles. Response categories for γ-glutamyl transferase (GGT) were established based on whether the patient results were within or above the sex-specific reference interval. Single decision tree and SVMs were applied to test the accuracy of GGT prediction by the highest ranked predictors of GGT response, alkaline phosphatase (ALP) and alanine amino-transaminase (ALT).Through interrogating more than 20,000 individual cases comprising both sexes and all ages, decision trees predicted GGT category at 90% accuracy using only ALP and ALT, with a SVM prediction accuracy of 82.6% after 10-fold training and testing. Bilirubin, lactate dehydrogenase (LD) and albumin did not enhance prediction, or reduced accuracy. Comparison of abnormal (elevated) GGT categories also supported the primacy of ALP and ALT as screening markers, with serum urate and cholesterol also useful.Machine-learning interrogation of massive clinical chemistry data sets demonstrated a strategy to address redundancy in routine LFT screening by identifying ALT and ALP in tandem as able to accurately predict GGT elevation, suggesting that GGT can be removed from routine LFT screening.

Download Full-text

Prediction of venous thromboembolism with machine learning techniques in young-middle-aged inpatients

Scientific Reports ◽

10.1038/s41598-021-92287-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hua Liu ◽

Hua Yuan ◽

Yongmei Wang ◽

Weiwei Huang ◽

Hui Xue ◽

...

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Support Vector Machine Model ◽

Adverse Outcomes ◽

Machine Learning Techniques ◽

Support Vector ◽

Data Sets ◽

Middle Aged ◽

Machine Model ◽

Learning Techniques

AbstractAccumulating studies appear to suggest that the risk factors for venous thromboembolism (VTE) among young-middle-aged inpatients are different from those among elderly people. Therefore, the current prediction models for VTE are not applicable to young-middle-aged inpatients. The aim of this study was to develop and externally validate a new prediction model for young-middle-aged people using machine learning methods. The clinical data sets linked with 167 inpatients with deep venous thrombosis (DVT) and/or pulmonary embolism (PE) and 406 patients without DVT or PE were compared and analysed with machine learning techniques. Five algorithms, including logistic regression, decision tree, feed-forward neural network, support vector machine, and random forest, were used for training and preparing the models. The support vector machine model had the best performance, with AUC values of 0.806–0.944 for 95% CI, 59% sensitivity and 99% specificity, and an accuracy of 87%. Although different top predictors of adverse outcomes appeared in the different models, life-threatening illness, fibrinogen, RBCs, and PT appeared to be more consistently featured by the different models as top predictors of adverse outcomes. Clinical data sets of young and middle-aged inpatients can be used to accurately predict the risk of VTE with a support vector machine model.

Download Full-text

THE USE OF MACHINE LEARNING METHODS FOR BINARY CLASSIFICATION OF THE WORKING CONDITION OF BEARINGS USING THE SIGNALS OF VIBRATION ACCELERATION

Bulletin of National Technical University KhPI Series System Analysis Control and Information Technologies ◽

10.20998/2079-0023.2021.02.03 ◽

2021 ◽

pp. 15-22

Author(s):

Ruslan Babudzhan ◽

Konstantyn Isaienkov ◽

Danilo Krasiy ◽

Oleksii Vodka ◽

Ivan Zadorozhny ◽

...

Keyword(s):

Machine Learning ◽

Binary Classification ◽

Fractal Dimensions ◽

Feature Space ◽

Training Data ◽

Supervised Machine Learning ◽

Support Vector ◽

Data Sets ◽

Vibration Acceleration ◽

K Nearest Neighbors

The paper investigates the relationship between vibration acceleration of bearings with their operational state. To determine these dependencies, a testbench was built and 112 experiments were carried out with different bearings: 100 bearings that developed an internal defect during operation and 12bearings without a defect. From the obtained records, a dataset was formed, which was used to build classifiers. Dataset is freely available. A methodfor classifying new and used bearings was proposed, which consists in searching for dependencies and regularities of the signal using descriptive functions: statistical, entropy, fractal dimensions and others. In addition to processing the signal itself, the frequency domain of the bearing operationsignal was also used to complement the feature space. The paper considered the possibility of generalizing the classification for its application on thosesignals that were not obtained in the course of laboratory experiments. An extraneous dataset was found in the public domain. This dataset was used todetermine how accurate a classifier was when it was trained and tested on significantly different signals. Training and validation were carried out usingthe bootstrapping method to eradicate the effect of randomness, given the small amount of training data available. To estimate the quality of theclassifiers, the F1-measure was used as the main metric due to the imbalance of the data sets. The following supervised machine learning methodswere chosen as classifier models: logistic regression, support vector machine, random forest, and K nearest neighbors. The results are presented in theform of plots of density distribution and diagrams.

Download Full-text