A Data-Analytics Tutorial: Building Predictive Models for Oil Production in an Unconventional Shale Reservoir

Summary Considerable amounts of data are being generated during the development and operation of unconventional reservoirs. Statistical methods that can provide data-driven insights into production performance are gaining in popularity. Unfortunately, the application of advanced statistical algorithms remains somewhat of a mystery to petroleum engineers and geoscientists. The objective of this paper is to provide some clarity to this issue, focusing on how to build robust predictive models and how to develop decision rules that help identify factors separating good wells from poor performers. The data for this study come from wells completed in the Wolfcamp Shale Formation in the Permian Basin. Data categories used in the study included well location and assorted metrics capturing various aspects of well architecture, well completion, stimulation, and production. Predictive models for the production metric of interest are built using simple regression and other advanced methods such as random forests (RFs), support-vector regression (SVR), gradient-boosting machine (GBM), and multidimensional Kriging. The data-fitting process involves splitting the data into a training set and a test set, building a regression model on the training set and validating it with the test set. Repeated application of a “cross-validation” procedure yields valuable information regarding the robustness of each regression-modeling approach. Furthermore, decision rules that can identify extreme behavior in production wells (i.e., top x% of the wells vs. bottom x%, as ranked by the production metric) are generated using the classification and regression-tree algorithm. The resulting decision tree (DT) provides useful insights regarding what variables (or combinations of variables) can drive production performance into such extreme categories. The main contributions of this paper are to provide guidelines on how to build robust predictive models, and to demonstrate the utility of DTs for identifying factors responsible for good vs. poor wells.

Download Full-text

Estimating the Optimal Dexketoprofen Pharmaceutical Formulation with Machine Learning Methods and Statistical Approaches

Healthcare Informatics Research ◽

10.4258/hir.2021.27.4.279 ◽

2021 ◽

Vol 27 (4) ◽

pp. 279-286

Author(s):

Atakan Başkor ◽

Yağmur Pirinçci Tok ◽

Burcu Mesut ◽

Yıldız Özsoy ◽

Tamer Uçar

Keyword(s):

Machine Learning ◽

Regression Tree ◽

Cost Effective ◽

Pharmaceutical Formulation ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Disintegration Time ◽

Pharmaceutical Dosage Form ◽

Extreme Gradient Boosting

Objectives: Orally disintegrating tablets (ODTs) can be utilized without any drinking water; this feature makes ODTs easy to use and suitable for specific groups of patients. Oral administration of drugs is the most commonly used route, and tablets constitute the most preferable pharmaceutical dosage form. However, the preparation of ODTs is costly and requires long trials, which creates obstacles for dosage trials. The aim of this study was to identify the most appropriate formulation using machine learning (ML) models of ODT dexketoprofen formulations, with the goal of providing a cost-effective and timereducing solution.Methods: This research utilized nonlinear regression models, including the k-nearest neighborhood (k-NN), support vector regression (SVR), classification and regression tree (CART), bootstrap aggregating (bagging), random forest (RF), gradient boosting machine (GBM), and extreme gradient boosting (XGBoost) methods, as well as the t-test, to predict the quantity of various components in the dexketoprofen formulation within fixed criteria.Results: All the models were developed with Python libraries. The performance of the ML models was evaluated with R2 values and the root mean square error. Hardness values of 0.99 and 2.88, friability values of 0.92 and 0.02, and disintegration time values of 0.97 and 10.09 using the GBM algorithm gave the best results.Conclusions: In this study, we developed a computational approach to estimate the optimal pharmaceutical formulation of dexketoprofen. The results were evaluated by an expert, and it was found that they complied with Food and Drug Administration criteria.

Download Full-text

Well Completion Optimization in Unconventional Reservoirs Using Machine Learning Methods

10.2118/206241-ms ◽

2021 ◽

Author(s):

Sohrat Baki ◽

Cenk Temizel ◽

Serkan Dursun

Keyword(s):

Machine Learning ◽

Predictive Models ◽

Mapping Function ◽

Volume Effect ◽

Production Performance ◽

Unconventional Reservoirs ◽

Support Vector ◽

Feature Engineering ◽

Well Completion ◽

The Impact

Abstract Unconventional reservoirs, mainly shale oil and natural gas, will continue to significantly help meet the ever-growing energy demands of global markets. Being complex in nature and having ultra-tight producing zones, unconventionals depends on effective well completion and stimulation treatments in order to be successful and economical. Within the last decade, thousands of unconventional wells have been drilled, completed and produced in North America. The scope of this work is exploring the primary impact of completion parameters such as lateral length, frac type, number of stages, proppant and fluid volume effect on the production performance of the wells in unconventional fields. The key attributes in completion, stimulation, and production for the wells were considered in machine learning workflow for building predictive models. Predictive models based on Neural Networks, Support Vector Machines or Decision Tree Based ensemble models, serves as mapping function from completion parameters to production in each well in the field. The completion parameters were analyzed in the workflow with respect to feature engineering and interpretation. This analysis resulted in key performance indicators for the region. Then the optimum values for the best production performing completions were identified for each well. Predictive models in the workflow were analyzed in accuracy and best model is used to understand the impact of completion parameters on the production rates. This study outlines an overall machine learning workflow, from feature engineering to interpretation of the machine learning models to quantify the effects of completion parameters on the production rate of the wells in unconventional fields

Download Full-text

Combining Binary and Post-Classification Change Analysis of Augmented ALOS Backscatter for Identifying Subtle Land Cover Changes

Remote Sensing ◽

10.3390/rs11010100 ◽

2019 ◽

Vol 11 (1) ◽

pp. 100 ◽

Cited By ~ 7

Author(s):

Dyah R. Panuju ◽

David J. Paull ◽

Bambang H. Trisasongko

Keyword(s):

Random Forest ◽

Classification Accuracy ◽

Regression Tree ◽

Google Earth ◽

Variable Importance ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Change Analysis ◽

Extreme Gradient Boosting

This research aims to detect subtle changes by combining binary change analysis, the Iteratively Reweighted Multivariate Alteration Detection (IRMAD), over dual polarimetric Advanced Land Observing Satellite (ALOS) backscatter with augmented data for post-classification change analysis. The accuracy of change detection was iteratively evaluated based on thresholds composed of mean and a range constant of standard deviation. Four datasets were examined for post-classification change analysis including the dual polarimetric backscatter as the benchmark and its augmented data with indices, entropy alpha decomposition and selected texture features. Variable importance was then evaluated to build a best subset model employing seven classifiers, including Bagged Classification and Regression Tree (CAB), Extreme Learning Machine Neural Network (ENN), Bagged Multivariate Adaptive Regression Spline (MAB), Regularised Random Forest (RFG), Original Random Forest (RFO), Support Vector Machine (SVM), and Extreme Gradient Boosting Tree (XGB). The best accuracy was 98.8%, which resulted from thresholding MAD variate-2 with constants at 1.7. The highest improvement of classification accuracy was obtained by amending the grey level co-occurrence matrix (GLCM) texture. The identification of variable importance (VI) confirmed that selected GLCM textures (mean and variance of HH or HV) were equally superior, while the contribution of index and decomposition were negligible. The best model produced similar classification accuracy at about 90% for both years 2007 and 2010. Tree-based algorithms including RFO, RFG and XGB were more robust than SVM and ENN. Subtle changes indicated by binary change analysis were somewhat hidden in post-classification analysis. Reclassification by combining all important variables and adding five classes to include subtle changes assisted by Google Earth yielded an accuracy of 82%.

Download Full-text

Applications of machine learning techniques to predict filariasis using socio-economic factors

Epidemiology and Infection ◽

10.1017/s0950268819001481 ◽

2019 ◽

Vol 147 ◽

Author(s):

Phani Krishna Kondeti ◽

Kumar Ravi ◽

Srinivasa Rao Mutheneni ◽

Madhusudhan Rao Kadiri ◽

Sriram Kumaraswamy ◽

...

Keyword(s):

Machine Learning ◽

Probabilistic Neural Network ◽

Decision Rules ◽

Regression Tree ◽

Warning System ◽

Pilot Scale ◽

Classification And Regression Tree ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Learning Techniques

Abstract Filariasis is one of the major public health concerns in India. Approximately 600 million people spread across 250 districts of India are at risk of filariasis. To predict this disease, a pilot scale study was carried out in 30 villages of Karimnagar district of Telangana from 2004 to 2007 to collect epidemiological and socio-economic data. The collected data are analysed by employing various machine learning techniques such as Naïve Bayes (NB), logistic model tree, probabilistic neural network, J48 (C4.5), classification and regression tree, JRip and gradient boosting machine. The performances of these algorithms are reported using sensitivity, specificity, accuracy and area under ROC curve (AUC). Among all employed classification methods, NB yielded the best AUC of 64% and was equally statistically significant with the rest of the classifiers. Similarly, the J48 algorithm generated 23 decision rules that help in developing an early warning system to implement better prevention and control efforts in the management of filariasis.

Download Full-text

APPLYING ECONOMIC MEASURES TO LAPSE RISK MANAGEMENT WITH MACHINE LEARNING APPROACHES

Astin Bulletin ◽

10.1017/asb.2021.10 ◽

2021 ◽

pp. 1-33

Author(s):

Stéphane Loisel ◽

Pierrick Piette ◽

Cheng-Hsien Jason Tsai

Keyword(s):

Machine Learning ◽

Risk Management ◽

Regression Tree ◽

Classification Problem ◽

Point Of View ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Abstract Modeling policyholders’ lapse behaviors is important to a life insurer, since lapses affect pricing, reserving, profitability, liquidity, risk management, and the solvency of the insurer. In this paper, we apply two machine learning methods to lapse modeling. Then, we evaluate the performance of these two methods along with two popular statistical methods by means of statistical accuracy and profitability measure. Moreover, we adopt an innovative point of view on the lapse prediction problem that comes from churn management. We transform the classification problem into a regression question and then perform optimization, which is new to lapse risk management. We apply the aforementioned four methods to a large real-world insurance dataset. The results show that Extreme Gradient Boosting (XGBoost) and support vector machine outperform logistic regression (LR) and classification and regression tree with respect to statistic accuracy, while LR performs as well as XGBoost in terms of retention gains. This highlights the importance of a proper validation metric when comparing different methods. The optimization after the transformation brings out significant and consistent increases in economic gains. Therefore, the insurer should conduct optimization on its economic objective to achieve optimal lapse management.

Download Full-text

Using Decision Tree to Predict Response Rates of Consumer Satisfaction, Attitude, and Loyalty Surveys

Sustainability ◽

10.3390/su11082306 ◽

2019 ◽

Vol 11 (8) ◽

pp. 2306 ◽

Cited By ~ 1

Author(s):

Jian Han ◽

Miaodan Fang ◽

Shenglu Ye ◽

Chuansheng Chen ◽

Qun Wan ◽

...

Keyword(s):

Decision Tree ◽

Consumer Satisfaction ◽

Response Rate ◽

Regression Tree ◽

Response Rates ◽

Classification And Regression Tree ◽

Training Set ◽

Test Set ◽

Predictors Of Response ◽

Predicted Values

Response rate has long been a major concern in survey research commonly used in many fields such as marketing, psychology, sociology, and public policy. Based on 244 published survey studies on consumer satisfaction, loyalty, and trust, this study aimed to identify factors that were predictors of response rates. Results showed that response rates were associated with the mode of data collection (face-to-face > mail/telephone > online), type of survey sponsors (government agencies > universities/research institutions > commercial entities), confidentiality (confidential > non-confidential), direct invitation (yes > no), and cultural orientation (individualism > collectivism). A decision tree regression analysis (using classification and regression Tree (C&RT) algorithm on 80% of the studies as the training set and 20% as the test set) revealed that a model with all above-mentioned factors attained a linear correlation coefficient (0.578) between the predicted values and actual values, which was higher than the corresponding coefficient of the traditional linear regression model (0.423). A decision tree analysis (using C5.0 algorithm on 80% of the studies as the training set and 20% as the test set) revealed that a model with all above-mentioned factors attained an overall accuracy of 78.26% in predicting whether a survey had a high (>50%) or low (<50%) response rate. Direct invitation was the most important factor in all three models and had a consistent trend in predicting response rate.

Download Full-text

Automated Screening of Emergency Department Notes for Drug-Associated Bleeding Adverse Events Occurring in Older Adults

Applied Clinical Informatics ◽

10.4338/aci-2017-02-ra-0036 ◽

2017 ◽

Vol 08 (04) ◽

pp. 1022-1030 ◽

Cited By ~ 1

Author(s):

Richard Boyce ◽

Jeremy Jao ◽

Taylor Miller ◽

Sandra Kane-Gill

Keyword(s):

Emergency Department ◽

Random Forest ◽

Adverse Drug Events ◽

Regression Tree ◽

Classification Performance ◽

Classification And Regression Tree ◽

Training Set ◽

Test Set ◽

Packed Red Blood Cells ◽

Automated Screening

Objective To conduct research to show the value of text mining for automatically identifying suspected bleeding adverse drug events (ADEs) in the emergency department (ED). Methods A corpus of ED admission notes was manually annotated for bleeding ADEs. The notes were taken for patients ≥ 65 years of age who had an ICD-9 code for bleeding, the presence of hemoglobin value ≤ 8 g/dL, or were transfused > 2 units of packed red blood cells. This training corpus was used to develop bleeding ADE algorithms using Random Forest and Classification and Regression Tree (CART). A completely separate set of notes was annotated and used to test the classification performance of the final models using the area under the ROC curve (AUROC). Results The best performing CART resulted in an AUROC on the training set of 0.882. The model's AUROC on the test set was 0.827. At a sensitivity of 0.679, the model had a specificity of 0.908 and a positive predictive value (PPV) of 0.814. It had a relatively simple and intuitive structure consisting of 13 decision nodes and 14 leaf nodes. Decision path probabilities ranged from 0.041 to 1.0. The AUROC for the best performing Random Forest method on the training set was 0.917. On the test set, the model's AUROC was 0.859. At a sensitivity of 0.274, the model had a specificity of 0.986 and a PPV of 0.92. Conclusion Both models accurately identify bleeding ADEs using the presence or absence of certain clinical concepts in ED admission notes for older adult patients. The CART model is particularly noteworthy because it does not require significant technical overhead to implement. Future work should seek to replicate the results on a larger test set pulled from another institution.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Exploring the Mechanism of Crashes with Autonomous Vehicles Using Machine Learning

Mathematical Problems in Engineering ◽

10.1155/2021/5524356 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Hengrui Chen ◽

Hong Chen ◽

Ruiyu Zhou ◽

Zhizhen Liu ◽

Xiaoke Sun

Keyword(s):

Machine Learning ◽

Autonomous Vehicles ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Crash Severity ◽

Apriori Algorithm ◽

Driving Mode ◽

Extreme Gradient Boosting ◽

The Impact

The safety issue has become a critical obstacle that cannot be ignored in the marketization of autonomous vehicles (AVs). The objective of this study is to explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. We use the Apriori algorithm to explore the causal relationship between multiple factors to explore the mechanism of crashes. We use various machine learning models, including support vector machine (SVM), classification and regression tree (CART), and eXtreme Gradient Boosting (XGBoost), to analyze the crash severity. Besides, we apply the Shapley Additive Explanations (SHAP) to interpret the importance of each factor. The results indicate that XGBoost obtains the best result (recall = 75%; G-mean = 67.82%). Both XGBoost and Apriori algorithm effectively provided meaningful insights about AV-involved crash characteristics and their relationship. Among all these features, vehicle damage, weather conditions, accident location, and driving mode are the most critical features. We found that most rear-end crashes are conventional vehicles bumping into the rear of AVs. Drivers should be extremely cautious when driving in fog, snow, and insufficient light. Besides, drivers should be careful when driving near intersections, especially in the autonomous driving mode.

Download Full-text

A Novel Design of Classification of Coronary Artery Disease Using Deep Learning and Data Mining Algorithms

Revue d intelligence artificielle ◽

10.18280/ria.350304 ◽

2021 ◽

Vol 35 (3) ◽

pp. 209-215

Author(s):

Pratibha Verma ◽

Vineet Kumar Awasthi ◽

Sanat Kumar Sahu

Keyword(s):

Neural Network ◽

Data Mining ◽

Deep Learning ◽

Regression Tree ◽

Principal Component ◽

Classification And Regression Tree ◽

Support Vector ◽

Data Mining Algorithms ◽

R Programming ◽

Hidden Layer

Data mining techniques are included with Ensemble learning and deep learning for the classification. The methods used for classification are, Single C5.0 Tree (C5.0), Classification and Regression Tree (CART), kernel-based Support Vector Machine (SVM) with linear kernel, ensemble (CART, SVM, C5.0), Neural Network-based Fit single-hidden-layer neural network (NN), Neural Networks with Principal Component Analysis (PCA-NN), deep learning-based H2OBinomialModel-Deeplearning (HBM-DNN) and Enhanced H2OBinomialModel-Deeplearning (EHBM-DNN). In this study, experiments were conducted on pre-processed datasets using R programming and 10-fold cross-validation technique. The findings show that the ensemble model (CART, SVM and C5.0) and EHBM-DNN are more accurate for classification, compared with other methods.

Download Full-text