Case Study ROP Modeling Using Random Forest Regression and Gradient Boosting in the Hanover Region in Germany

Abstract Reduction of drilling costs in the oil and gas industry and the geothermal energy sector is the main driver for major investments in drilling optimization research. The best way to reduce drilling costs is to minimize the overall time needed for drilling a well. This can be accomplished by optimizing the non-productive time during an operation, and through increasing the rate of penetration (ROP) while actively drilling. ROP has already been modeled in the past using empirical correlations. However, nowadays, methods from data science can be applied to the large data sets obtained during drilling operations, both for real-time prediction of drilling performance and for analysis of historical data sets during the evaluation of previous drilling activities. In the current study, data from a geothermal well in the Hanover region in Lower Saxony (Germany) were used to train machine learning models using Random Forest™ regression and Gradient Boosting. Both techniques showed promising results for modeling ROP.

Download Full-text

Boosting Algorithm Choice in Predictive Machine Learning Models for Fracturing Applications

10.2118/205642-ms ◽

2021 ◽

Author(s):

Abdul Muqtadir Khan

Keyword(s):

Machine Learning ◽

Data Science ◽

Oil And Gas ◽

Oil And Gas Industry ◽

Injection Rate ◽

Model Construction ◽

Gradient Boosting ◽

Light Gradient ◽

Fracture Damage ◽

Boosting Technique

Abstract With the advancement in machine learning (ML) applications, some recent research has been conducted to optimize fracturing treatments. There are a variety of models available using various objective functions for optimization and different mathematical techniques. There is a need to extend the ML techniques to optimize the choice of algorithm. For fracturing treatment design, the literature for comparative algorithm performance is sparse. The research predominantly shows that compared to the most commonly used regressors and classifiers, some sort of boosting technique consistently outperforms on model testing and prediction accuracy. A database was constructed for a heterogeneous reservoir. Four widely used boosting algorithms were used on the database to predict the design only from the output of a short injection/falloff test. Feature importance analysis was done on eight output parameters from the falloff analysis, and six were finalized for the model construction. The outputs selected for prediction were fracturing fluid efficiency, proppant mass, maximum proppant concentration, and injection rate. Extreme gradient boost (XGBoost), categorical boost (CatBoost), adaptive boost (AdaBoost), and light gradient boosting machine (LGBM) were the algorithms finalized for the comparative study. The sensitivity was done for a different number of classes (four, five, and six) to establish a balance between accuracy and prediction granularity. The results showed that the best algorithm choice was between XGBoost and CatBoost for the predicted parameters under certain model construction conditions. The accuracy for all outputs for the holdout sets varied between 80 and 92%, showing robust significance for a wider utilization of these models. Data science has contributed to various oil and gas industry domains and has tremendous applications in the stimulation domain. The research and review conducted in this paper add a valuable resource for the user to build digital databases and use the appropriate algorithm without much trial and error. Implementing this model reduced the complexity of the proppant fracturing treatment redesign process, enhanced operational efficiency, and reduced fracture damage by eliminating minifrac steps with crosslinked gel.

Download Full-text

Challenges and Opportunities of Building Fast GBDT Systems

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/632 ◽

2021 ◽

Author(s):

Zeyi Wen ◽

Qinbin Li ◽

Bingsheng He ◽

Bin Cui

Keyword(s):

Data Science ◽

Cluster Computing ◽

Online Advertising ◽

Large Data ◽

Large Data Sets ◽

Gradient Boosting ◽

Data Sets ◽

Survey Paper ◽

Advantages And Disadvantages ◽

Challenges And Opportunities

In the last few years, Gradient Boosting Decision Trees (GBDTs) have been widely used in various applications such as online advertising and spam filtering. However, GBDT training is often a key performance bottleneck for such data science pipelines, especially for training a large number of deep trees on large data sets. Thus, many parallel and distributed GBDT systems have been researched and developed to accelerate the training process. In this survey paper, we review the recent GBDT systems with respect to accelerations with emerging hardware as well as cluster computing, and compare the advantages and disadvantages of the existing implementations. Finally, we present the research opportunities and challenges in designing fast next generation GBDT systems.

Download Full-text

Data Science Applied to Pedagogical Methodologies Focused on Changing the Negative Perception of the Oil and Gas Industry in Colombia

10.2118/201582-ms ◽

2020 ◽

Author(s):

Israel Guevara ◽

David Ardila ◽

Kevin Daza ◽

Oscar Ovalle ◽

Paola Pastor ◽

...

Keyword(s):

Data Science ◽

Oil And Gas ◽

Oil And Gas Industry ◽

Gas Industry ◽

Negative Perception

Download Full-text

Machine Learning and Data Science in the Oil and Gas Industry

10.1016/c2019-0-02033-x ◽

2021 ◽

Keyword(s):

Machine Learning ◽

Data Science ◽

Oil And Gas ◽

Oil And Gas Industry ◽

Gas Industry

Download Full-text

A Detailed Study on Classification Algorithms in Big Data

Big Data Analytics for Sustainable Computing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-9750-6.ch002 ◽

2020 ◽

pp. 30-46

Author(s):

Saranya N. ◽

Saravana Selvam

Keyword(s):

Big Data ◽

Random Forest ◽

Linear Regression ◽

Comprehensive Evaluation ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Classification Methods ◽

Computing Science ◽

Data Collections

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.

Download Full-text

Data Science and Designing for Privacy

Techné: Research in Philosophy and Technology ◽

10.5840/techne201632446 ◽

2016 ◽

Vol 20 (1) ◽

pp. 51-68

Author(s):

Michael Falgoust ◽

Keyword(s):

Data Science ◽

Personal Information ◽

Information Age ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Informational Privacy ◽

Design Constraint ◽

Classical Models ◽

Privacy Problem

Unprecedented advances in the ability to store, analyze, and retrieve data is the hallmark of the information age. Along with enhanced capability to identify meaningful patterns in large data sets, contemporary data science renders many classical models of privacy protection ineffective. Addressing these issues through privacy-sensitive design is insufficient because advanced data science is mutually exclusive with preserving privacy. The special privacy problem posed by data analysis has so far escaped even leading accounts of informational privacy. Here, I argue that accounts of privacy must include norms about information processing in addition to norms about information flow. Ultimately, users need the resources to control how and when personal information is processed and the knowledge to make information decisions about that control. While privacy is an insufficient design constraint, value-sensitive design around control and transparency can support privacy in the information age.

Download Full-text

Adding value to acoustic data sets in the offshore oil and gas industry by innovative processing and data integration techniques

2015 IEEE/OES Acoustics in Underwater Geosciences Symposium (RIO Acoustics) ◽

10.1109/rioacoustics.2015.7473612 ◽

2015 ◽

Author(s):

Miguel Redusino ◽

Fabio Mayo Belligotti ◽

Heitor Augusto Tozzi ◽

Graeme Jaques

Keyword(s):

Data Integration ◽

Oil And Gas ◽

Oil And Gas Industry ◽

Data Sets ◽

Gas Industry ◽

Acoustic Data ◽

Integration Techniques ◽

Offshore Oil And Gas ◽

Offshore Oil

Download Full-text

Visualizing Denmark's Cultural Heritage

10.31219/osf.io/bt4yj ◽

2020 ◽

Author(s):

Stefan Jänicke

Keyword(s):

Cultural Heritage ◽

Data Science ◽

Large Data ◽

Large Data Sets ◽

Original Version ◽

Visual Design ◽

Data Sets

Visualization as a method to reveal patterns in large data sets is a powerful tool to build bridges between data science and other research disciplines. The value of visual design is documented with a showcase on the Dansk biografisk Lexikon. The original version of this article was published in the 2020 November issue of Aktuel Naturvidenskab.

Download Full-text

Comparison of the Validity and Generalizability of Machine Learning Algorithms for the Prediction of Energy Expenditure: Validation Study

JMIR mhealth and uhealth ◽

10.2196/23938 ◽

2021 ◽

Vol 9 (8) ◽

pp. e23938

Author(s):

Ruairi O'Driscoll ◽

Jake Turicchi ◽

Mark Hopkins ◽

Cristiana Duarte ◽

Graham W Horgan ◽

...

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Neural Networks ◽

Random Forest ◽

Energy Expenditure ◽

Superior Performance ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Out Of Sample

Background Accurate solutions for the estimation of physical activity and energy expenditure at scale are needed for a range of medical and health research fields. Machine learning techniques show promise in research-grade accelerometers, and some evidence indicates that these techniques can be applied to more scalable commercial devices. Objective This study aims to test the validity and out-of-sample generalizability of algorithms for the prediction of energy expenditure in several wearables (ie, Fitbit Charge 2, ActiGraph GT3-x, SenseWear Armband Mini, and Polar H7) using two laboratory data sets comprising different activities. Methods Two laboratory studies (study 1: n=59, age 44.4 years, weight 75.7 kg; study 2: n=30, age=31.9 years, weight=70.6 kg), in which adult participants performed a sequential lab-based activity protocol consisting of resting, household, ambulatory, and nonambulatory tasks, were combined in this study. In both studies, accelerometer and physiological data were collected from the wearables alongside energy expenditure using indirect calorimetry. Three regression algorithms were used to predict metabolic equivalents (METs; ie, random forest, gradient boosting, and neural networks), and five classification algorithms (ie, k-nearest neighbor, support vector machine, random forest, gradient boosting, and neural networks) were used for physical activity intensity classification as sedentary, light, or moderate to vigorous. Algorithms were evaluated using leave-one-subject-out cross-validations and out-of-sample validations. Results The root mean square error (RMSE) was lowest for gradient boosting applied to SenseWear and Polar H7 data (0.91 METs), and in the classification task, gradient boost applied to SenseWear and Polar H7 was the most accurate (85.5%). Fitbit models achieved an RMSE of 1.36 METs and 78.2% accuracy for classification. Errors tended to increase in out-of-sample validations with the SenseWear neural network achieving RMSE values of 1.22 METs in the regression tasks and the SenseWear gradient boost and random forest achieving an accuracy of 80% in classification tasks. Conclusions Algorithms trained on combined data sets demonstrated high predictive accuracy, with a tendency for superior performance of random forests and gradient boosting for most but not all wearable devices. Predictions were poorer in the between-study validations, which creates uncertainty regarding the generalizability of the tested algorithms.

Download Full-text

A Comparative Analysis of Machine/Deep Learning Models for Parking Space Availability Prediction

Sensors ◽

10.3390/s20010322 ◽

2020 ◽

Vol 20 (1) ◽

pp. 322 ◽

Cited By ~ 9

Author(s):

Faraz Malik Awan ◽

Yasir Saleem ◽

Roberto Minerva ◽

Noel Crespi

Keyword(s):

Deep Learning ◽

Comparative Analysis ◽

Random Forest ◽

Decision Tree ◽

Multilayer Perceptron ◽

Large Data ◽

Data Sets ◽

Application Domain ◽

Parking Space ◽

Data Set

Machine/Deep Learning (ML/DL) techniques have been applied to large data sets in order to extract relevant information and for making predictions. The performance and the outcomes of different ML/DL algorithms may vary depending upon the data sets being used, as well as on the suitability of algorithms to the data and the application domain under consideration. Hence, determining which ML/DL algorithm is most suitable for a specific application domain and its related data sets would be a key advantage. To respond to this need, a comparative analysis of well-known ML/DL techniques, including Multilayer Perceptron, K-Nearest Neighbors, Decision Tree, Random Forest, and Voting Classifier (or the Ensemble Learning Approach) for the prediction of parking space availability has been conducted. This comparison utilized Santander’s parking data set, initiated while working on the H2020 WISE-IoT project. The data set was used in order to evaluate the considered algorithms and to determine the one offering the best prediction. The results of this analysis show that, regardless of the data set size, the less complex algorithms like Decision Tree, Random Forest, and KNN outperform complex algorithms such as Multilayer Perceptron, in terms of higher prediction accuracy, while providing comparable information for the prediction of parking space availability. In addition, in this paper, we are providing Top-K parking space recommendations on the basis of distance between current position of vehicles and free parking spots.

Download Full-text