Automatic Identification of Rock Formation Type While Drilling Using Machine Learning Based Data-Driven Models

Abstract The objective of this study is to present a novel rock formation identification model using a data-driven modeling approach. This study explores the use of real-time drilling data to train and validate a classification model to improve the efficiency of the drilling process by reducing Mechanical Specific Energy (MSE). In this study, we demonstrate the feasibility of a layer-based determination and change detection of properties of rock formation currently being drilled as accurately and fast as possible. Data for this study was collected from a custom-built lab-scale drilling rig equipped with multiple sensors. The experiment was conducted by drilling through an arrangement of different rock formations of varying rock strength properties. Data was recorded and stored at a frequency of 2 kHz, then filtered, processed, and downsampled to extract relevant features. This dataset was used to train an Artificial Neural Network and other machine learning classification algorithms. Feature selection was made first with ten most notable features found by Random Forest, and the second set with derived measurements and down-sampled dynamic features from the sensors. The classification analysis was divided into two steps: the best predictors/features extraction and classification model building. The models were trained using multiple classification algorithms, namely logistic regression, linear discriminant analysis (LDA), Support Vector Machines (SVM), Random Forest (RF), and Artificial Neural Networks (ANN). It was found that random forest and ANN performed the best with prediction accuracy of 99.48% and 99.58%, respectively, for the data set with ten most prominent features. The high prediction rate accuracy for the most prominent predictors suggests that if the high-frequency data can be processed in real-time, predicting what formation we are drilling in is possible to achieve in near real-time. This can lead to significant savings for drilling companies as optimal drilling parameters can be computed, and in turn, optimized Mechanical Specific Energy can be obtained in real-time. Since the rock formation identification is time-consuming, we also describe here an alternative approach using slightly less accurate but equally powerful dynamic predictors. In this case, we show that our dynamic predictor models with RF and ANN yielded prediction accuracy of 96.30% and 95.61%, respectively. Both the prominent feature and dynamic predictor approaches are described in detail in this paper. Our results suggest that accurately predicting rock formation type in real-time while drilling is very much feasible with lesser computational cost and complexity. This study provides the building blocks for the development of a completely autonomous downhole device and Electronic Device Recorders (EDR) that reduces the need for highly sophisticated sensors or data transmission processes downhole.

Download Full-text

Prediction of Breast Cancer Using Machine Learning

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190617160834 ◽

2020 ◽

Vol 13 (5) ◽

pp. 901-908

Author(s):

Somil Jain ◽

Puneet Kumar

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Prediction Accuracy ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Classification Algorithms ◽

Breast Cancer Dataset

Background:: Breast cancer is one of the diseases which cause number of deaths ever year across the globe, early detection and diagnosis of such type of disease is a challenging task in order to reduce the number of deaths. Now a days various techniques of machine learning and data mining are used for medical diagnosis which has proven there metal by which prediction can be done for the chronic diseases like cancer which can save the life’s of the patients suffering from such type of disease. The major concern of this study is to find the prediction accuracy of the classification algorithms like Support Vector Machine, J48, Naïve Bayes and Random Forest and to suggest the best algorithm. Objective:: The objective of this study is to assess the prediction accuracy of the classification algorithms in terms of efficiency and effectiveness. Methods: This paper provides a detailed analysis of the classification algorithms like Support Vector Machine, J48, Naïve Bayes and Random Forest in terms of their prediction accuracy by applying 10 fold cross validation technique on the Wisconsin Diagnostic Breast Cancer dataset using WEKA open source tool. Results:: The result of this study states that Support Vector Machine has achieved the highest prediction accuracy of 97.89 % with low error rate of 0.14%. Conclusion:: This paper provides a clear view over the performance of the classification algorithms in terms of their predicting ability which provides a helping hand to the medical practitioners to diagnose the chronic disease like breast cancer effectively.

Download Full-text

ANALYSIS OF THE INFLUENCE OF MACHINE LEARNING ALGORITHM PARAMETERS ON THE RESULTS OF TRAFFIC CLASSIFICATION IN REAL TIME

T-Comm ◽

10.36724/2072-8735-2021-15-9-24-35 ◽

2021 ◽

Vol 15 (9) ◽

pp. 24-35

Author(s):

Irina A. Krasnova ◽

Keyword(s):

Machine Learning ◽

Random Forest ◽

Real Time ◽

Experimental Studies ◽

Machine Learning Algorithms ◽

Classification Model ◽

Traffic Classification ◽

Data Set ◽

Minimum Number ◽

The Impact

The paper analyzes the impact of setting the parameters of Machine Learning algorithms on the results of traffic classification in real-time. The Random Forest and XGBoost algorithms are considered. A brief description of the work of both methods and methods for evaluating the results of classification is given. Experimental studies are conducted on a database obtained on a real network, separately for TCP and UDP flows. In order for the results of the study to be used in real time, a special feature matrix is created based on the first 15 packets of the flow. The main parameters of the Random Forest (RF) algorithm for configuration are the number of trees, the partition criterion used, the maximum number of features for constructing the partition function, the depth of the tree, and the minimum number of samples in the node and in the leaf. For XGBoost, the number of trees, the depth of the tree, the minimum number of samples in the leaf, for features, and the percentage of samples needed to build the tree are taken. Increasing the number of trees leads to an increase in accuracy to a certain value, but as shown in the article, it is important to make sure that the model is not overfitted. To combat overfitting, the remaining parameters of the trees are used. In the data set under study, by eliminating overfitting, it was possible to achieve an increase in classification accuracy for individual applications by 11-12% for Random Forest and by 12-19% for XGBoost. The results show that setting the parameters is a very important step in building a traffic classification model, because it helps to combat overfitting and significantly increases the accuracy of the algorithm’s predictions. In addition, it was shown that if the parameters are properly configured, XGBoost, which is not very popular in traffic classification works, becomes a competitive algorithm and shows better results compared to the widespread Random Forest.

Download Full-text

Applications of Machine Learning for the Classification of Porcine Reproductive and Respiratory Syndrome Virus Sublineages Using Amino Acid Scores of ORF5 Gene

Frontiers in Veterinary Science ◽

10.3389/fvets.2021.683134 ◽

2021 ◽

Vol 8 ◽

Author(s):

Jeonghoon Kim ◽

Kyuyoung Lee ◽

Ruwini Rupasinghe ◽

Shahbaz Rezaei ◽

Beatriz Martínez-López ◽

...

Keyword(s):

Machine Learning ◽

Phylogenetic Analysis ◽

Amino Acid ◽

Random Forest ◽

Real Time ◽

Classification Model ◽

Support Vector ◽

Operating Characteristics ◽

Orf5 Gene

Porcine reproductive and respiratory syndrome is an infectious disease of pigs caused by PRRS virus (PRRSV). A modified live-attenuated vaccine has been widely used to control the spread of PRRSV and the classification of field strains is a key for a successful control and prevention. Restriction fragment length polymorphism targeting the Open reading frame 5 (ORF5) genes is widely used to classify PRRSV strains but showed unstable accuracy. Phylogenetic analysis is a powerful tool for PRRSV classification with consistent accuracy but it demands large computational power as the number of sequences gets increased. Our study aimed to apply four machine learning (ML) algorithms, random forest, k-nearest neighbor, support vector machine and multilayer perceptron, to classify field PRRSV strains into four clades using amino acid scores based on ORF5 gene sequence. Our study used amino acid sequences of ORF5 gene in 1931 field PRRSV strains collected in the US from 2012 to 2020. Phylogenetic analysis was used to labels field PRRSV strains into one of four clades: Lineage 5 or three clades in Linage 1. We measured accuracy and time consumption of classification using four ML approaches by different size of gene sequences. We found that all four ML algorithms classify a large number of field strains in a very short time (<2.5 s) with very high accuracy (>0.99 Area under curve of the Receiver of operating characteristics curve). Furthermore, the random forest approach detects a total of 4 key amino acid positions for the classification of field PRRSV strains into four clades. Our finding will provide an insightful idea to develop a rapid and accurate classification model using genetic information, which also enables us to handle large genome datasets in real time or semi-real time for data-driven decision-making and more timely surveillance.

Download Full-text

Transformer Oil Quality Assessment Using Random Forest with Feature Engineering

Energies ◽

10.3390/en14071809 ◽

2021 ◽

Vol 14 (7) ◽

pp. 1809

Author(s):

Mohammed El Amine Senoussaoui ◽

Mostefa Brahami ◽

Issouf Fofana

Keyword(s):

Machine Learning ◽

Random Forest ◽

Oil Quality ◽

Principal Component ◽

Condition Assessment ◽

Classification Performance ◽

Transformer Oil ◽

Classification Model ◽

Insulation Degradation ◽

Transformer Oils

Machine learning is widely used as a panacea in many engineering applications including the condition assessment of power transformers. Most statistics attribute the main cause of transformer failure to insulation degradation. Thus, a new, simple, and effective machine-learning approach was proposed to monitor the condition of transformer oils based on some aging indicators. The proposed approach was used to compare the performance of two machine-learning classifiers: J48 decision tree and random forest. The service-aged transformer oils were classified into four groups: the oils that can be maintained in service, the oils that should be reconditioned or filtered, the oils that should be reclaimed, and the oils that must be discarded. From the two algorithms, random forest exhibited a better performance and high accuracy with only a small amount of data. Good performance was achieved through not only the application of the proposed algorithm but also the approach of data preprocessing. Before feeding the classification model, the available data were transformed using the simple k-means method. Subsequently, the obtained data were filtered through correlation-based feature selection (CFsSubset). The resulting features were again retransformed by conducting the principal component analysis and were passed through the CFsSubset filter. The transformation and filtration of the data improved the classification performance of the adopted algorithms, especially random forest. Another advantage of the proposed method is the decrease in the number of the datasets required for the condition assessment of transformer oils, which is valuable for transformer condition monitoring.

Download Full-text

Data-Driven Wildfire Risk Prediction in Northern California

Atmosphere ◽

10.3390/atmos12010109 ◽

2021 ◽

Vol 12 (1) ◽

pp. 109

Author(s):

Ashima Malik ◽

Megha Rajam Rao ◽

Nandini Puppala ◽

Prathusha Koouri ◽

Venkata Anil Kumar Thota ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Curves ◽

Data Driven ◽

Northern California ◽

Combined Model ◽

Wildfire Risk ◽

Study Results ◽

Forest Models ◽

Random Forest Models

Over the years, rampant wildfires have plagued the state of California, creating economic and environmental loss. In 2018, wildfires cost nearly 800 million dollars in economic loss and claimed more than 100 lives in California. Over 1.6 million acres of land has burned and caused large sums of environmental damage. Although, recently, researchers have introduced machine learning models and algorithms in predicting the wildfire risks, these results focused on special perspectives and were restricted to a limited number of data parameters. In this paper, we have proposed two data-driven machine learning approaches based on random forest models to predict the wildfire risk at areas near Monticello and Winters, California. This study demonstrated how the models were developed and applied with comprehensive data parameters such as powerlines, terrain, and vegetation in different perspectives that improved the spatial and temporal accuracy in predicting the risk of wildfire including fire ignition. The combined model uses the spatial and the temporal parameters as a single combined dataset to train and predict the fire risk, whereas the ensemble model was fed separate parameters that were later stacked to work as a single model. Our experiment shows that the combined model produced better results compared to the ensemble of random forest models on separate spatial data in terms of accuracy. The models were validated with Receiver Operating Characteristic (ROC) curves, learning curves, and evaluation metrics such as: accuracy, confusion matrices, and classification report. The study results showed and achieved cutting-edge accuracy of 92% in predicting the wildfire risks, including ignition by utilizing the regional spatial and temporal data along with standard data parameters in Northern California.

Download Full-text

Classification Model Simulator: A simulator for different Machine Learning Classification Algorithms

2021 2nd International Conference for Emerging Technology (INCET) ◽

10.1109/incet51464.2021.9456348 ◽

2021 ◽

Author(s):

Abhinandan Singla ◽

Unnati Chaturvedi ◽

Preet Kanwal

Keyword(s):

Machine Learning ◽

Classification Model ◽

Classification Algorithms ◽

Machine Learning Classification

Download Full-text

False Positive RFID Detection Using Classification Models

Applied Sciences ◽

10.3390/app9061154 ◽

2019 ◽

Vol 9 (6) ◽

pp. 1154 ◽

Cited By ~ 11

Author(s):

Ganjar Alfian ◽

Muhammad Syafrudin ◽

Bohan Yoon ◽

Jongtae Rhee

Keyword(s):

Machine Learning ◽

Supply Chain ◽

Real Time ◽

Outlier Detection ◽

Radio Frequency Identification ◽

False Positives ◽

Machine Learning Algorithms ◽

Classification Model ◽

Automated Identification ◽

Rfid Data

Radio frequency identification (RFID) is an automated identification technology that can be utilized to monitor product movements within a supply chain in real-time. However, one problem that occurs during RFID data capturing is false positives (i.e., tags that are accidentally detected by the reader but not of interest to the business process). This paper investigates using machine learning algorithms to filter false positives. Raw RFID data were collected based on various tagged product movements, and statistical features were extracted from the received signal strength derived from the raw RFID data. Abnormal RFID data or outliers may arise in real cases. Therefore, we utilized outlier detection models to remove outlier data. The experiment results showed that machine learning-based models successfully classified RFID readings with high accuracy, and integrating outlier detection with machine learning models improved classification accuracy. We demonstrated the proposed classification model could be applied to real-time monitoring, ensuring false positives were filtered and hence not stored in the database. The proposed model is expected to improve warehouse management systems by monitoring delivered products to other supply chain partners.

Download Full-text

Evaluation of the COVID-19 Era by Using Machine Learning and Interpretation of Confidential Dataset

Electronics ◽

10.3390/electronics10232910 ◽

2021 ◽

Vol 10 (23) ◽

pp. 2910

Author(s):

Andreas Andreou ◽

Constandinos X. Mavromoustakis ◽

George Mastorakis ◽

Jordi Mongay Batalla ◽

Evangelos Pallis

Keyword(s):

Machine Learning ◽

Real Time ◽

Research Study ◽

Nonlinear Least Squares ◽

Data Driven ◽

Machine Learning Technique ◽

Marquardt Algorithm ◽

Learning Technique ◽

Iot Devices ◽

Algorithmic Techniques

Various research approaches to COVID-19 are currently being developed by machine learning (ML) techniques and edge computing, either in the sense of identifying virus molecules or in anticipating the risk analysis of the spread of COVID-19. Consequently, these orientations are elaborating datasets that derive either from WHO, through the respective website and research portals, or from data generated in real-time from the healthcare system. The implementation of data analysis, modelling and prediction processing is performed through multiple algorithmic techniques. The lack of these techniques to generate predictions with accuracy motivates us to proceed with this research study, which elaborates an existing machine learning technique and achieves valuable forecasts by modification. More specifically, this study modifies the Levenberg–Marquardt algorithm, which is commonly beneficial for approaching solutions to nonlinear least squares problems, endorses the acquisition of data driven from IoT devices and analyses these data via cloud computing to generate foresight about the progress of the outbreak in real-time environments. Hence, we enhance the optimization of the trend line that interprets these data. Therefore, we introduce this framework in conjunction with a novel encryption process that we are proposing for the datasets and the implementation of mortality predictions.

Download Full-text

Distinguishing Focal Cortical Dysplasia From Glioneuronal Tumors in Patients With Epilepsy by Machine Learning

Frontiers in Neurology ◽

10.3389/fneur.2020.548305 ◽

2020 ◽

Vol 11 ◽

Author(s):

Yi Guo ◽

Yushan Liu ◽

Wenjie Ming ◽

Zhongjin Wang ◽

Junming Zhu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Focal Cortical Dysplasia ◽

Cortical Dysplasia ◽

Machine Learning Algorithms ◽

Classification Model ◽

Supervised Machine Learning ◽

Seizure Onset ◽

Glioneuronal Tumors ◽

Patients With Epilepsy

Purpose: We are aiming to build a supervised machine learning-based classifier, in order to preoperatively distinguish focal cortical dysplasia (FCD) from glioneuronal tumors (GNTs) in patients with epilepsy.Methods: This retrospective study was comprised of 96 patients who underwent epilepsy surgery, with the final neuropathologic diagnosis of either an FCD or GNTs. Seven classical machine learning algorithms (i.e., Random Forest, SVM, Decision Tree, Logistic Regression, XGBoost, LightGBM, and CatBoost) were employed and trained by our dataset to get the classification model. Ten features [i.e., Gender, Past history, Age at seizure onset, Course of disease, Seizure type, Seizure frequency, Scalp EEG biomarkers, MRI features, Lesion location, Number of antiepileptic drug (AEDs)] were analyzed in our study.Results: We enrolled 56 patients with FCD and 40 patients with GNTs, which included 29 with gangliogliomas (GGs) and 11 with dysembryoplasic neuroepithelial tumors (DNTs). Our study demonstrated that the Random Forest-based machine learning model offered the best predictive performance on distinguishing the diagnosis of FCD from GNTs, with an F1-score of 0.9180 and AUC value of 0.9340. Furthermore, the most discriminative factor between FCD and GNTs was the feature “age at seizure onset” with the Chi-square value of 1,213.0, suggesting that patients who had a younger age at seizure onset were more likely to be diagnosed as FCD.Conclusion: The Random Forest-based machine learning classifier can accurately differentiate FCD from GNTs in patients with epilepsy before surgery. This might lead to improved clinician confidence in appropriate surgical planning and treatment outcomes.

Download Full-text

A Data-Driven Approach using Machine Learning to Enable Real-Time Flight Path Planning

AIAA AVIATION 2020 FORUM ◽

10.2514/6.2020-2873 ◽

2020 ◽

Author(s):

Jung-Hyun Kim ◽

Simon I. Briceno ◽

Cedric Y. Justin ◽

Dimitri Mavris

Keyword(s):

Machine Learning ◽

Path Planning ◽

Real Time ◽

Flight Path ◽

Data Driven ◽

Data Driven Approach

Download Full-text