Attempt of Early Stuck Detection Using Unsupervised Deep Learning With Probability Mixture Model

Abstract The early detection of a stuck pipe during drilling operations is challenging and crucial. Some of the studies on stuck detection have adopted supervised machine learning approaches with ordinal support vector machines or neural networks using datasets for “stuck” and “normal”. However, for early detection before stuck occurs, the application of ordinal supervised machine learning has several concerns, such as limited stuck data, lack of an exact “stuck sign” before it occurs, and the various mechanisms involved in pipe sticking. This study acquires surface drilling data from various wells belonging to several agencies, examines the effectiveness of multiple learning models, and discusses the possibility of the early detection of pipe sticking before it occurs. Unsupervised machine learning using data on the normal activities is a possible advanced method for early stuck detection, which is adopted in this study. In addition, as a countermeasure to another concern that even normal activities involve various operations, we apply unsupervised learning with multiple learning models.

Download Full-text

Machine Learning Frameworks in Cancer Detection

E3S Web of Conferences ◽

10.1051/e3sconf/202129701073 ◽

2021 ◽

Vol 297 ◽

pp. 01073

Author(s):

Sabyasachi Pramanik ◽

K. Martin Sagayam ◽

Om Prakash Jena

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Cancer Development ◽

Support Vector ◽

Learning Approaches ◽

Learning Techniques ◽

Fact Finding ◽

Risk Of Cancer

Cancer has been described as a diverse illness with several distinct subtypes that may occur simultaneously. As a result, early detection and forecast of cancer types have graced essentially in cancer fact-finding methods since they may help to improve the clinical treatment of cancer survivors. The significance of categorizing cancer suffers into higher or lower-threat categories has prompted numerous fact-finding associates from the bioscience and genomics field to investigate the utilization of machine learning (ML) algorithms in cancer diagnosis and treatment. Because of this, these methods have been used with the goal of simulating the development and treatment of malignant diseases in humans. Furthermore, the capacity of machine learning techniques to identify important characteristics from complicated datasets demonstrates the significance of these technologies. These technologies include Bayesian networks and artificial neural networks, along with a number of other approaches. Decision Trees and Support Vector Machines which have already been extensively used in cancer research for the creation of predictive models, also lead to accurate decision making. The application of machine learning techniques may undoubtedly enhance our knowledge of cancer development; nevertheless, a sufficient degree of validation is required before these approaches can be considered for use in daily clinical practice. An overview of current machine learning approaches utilized in the simulation of cancer development is presented in this paper. All of the supervised machine learning approaches described here, along with a variety of input characteristics and data samples, are used to build the prediction models. In light of the increasing trend towards the use of machine learning methods in biomedical research, we offer the most current papers that have used these approaches to predict risk of cancer or patient outcomes in order to better understand cancer.

Download Full-text

Machine Learning Approaches for Detecting Tropical Cyclone Formation Using Satellite Data

Remote Sensing ◽

10.3390/rs11101195 ◽

2019 ◽

Vol 11 (10) ◽

pp. 1195 ◽

Cited By ~ 6

Author(s):

Minsang Kim ◽

Myung-Sook Park ◽

Jungho Im ◽

Seonyoung Park ◽

Myong-In Lee

Keyword(s):

Machine Learning ◽

Tropical Cyclone ◽

Surface Wind ◽

Support Vector ◽

Learning Approaches ◽

Yearly Variation ◽

Linear Discriminant ◽

Forecast Lead Time ◽

Vector Machines ◽

First Time

This study compared detection skill for tropical cyclone (TC) formation using models based on three different machine learning (ML) algorithms-decision trees (DT), random forest (RF), and support vector machines (SVM)-and a model based on Linear Discriminant Analysis (LDA). Eight predictors were derived from WindSat satellite measurements of ocean surface wind and precipitation over the western North Pacific for 2005–2009. All of the ML approaches performed better with significantly higher hit rates ranging from 94 to 96% compared with LDA performance (~77%), although false alarm rate by MLs is slightly higher (21–28%) than that by LDA (~13%). Besides, MLs could detect TC formation at the time as early as 26–30 h before the first time diagnosed as tropical depression by the JTWC best track, which was also 5 to 9 h earlier than that by LDA. The skill differences across MLs were relatively smaller than difference between MLs and LDA. Large yearly variation in forecast lead time was common in all models due to the limitation in sampling from orbiting satellite. This study highlights that ML approaches provide an improved skill for detecting TC formation compared with conventional linear approaches.

Download Full-text

Learning to Identify At-Risk Students in Distance Education Using Interaction Counts

Revista de Informática Teórica e Aplicada ◽

10.22456/2175-2745.62211 ◽

2016 ◽

Vol 23 (2) ◽

pp. 124 ◽

Cited By ~ 2

Author(s):

Douglas Detoni ◽

Cristian Cechinel ◽

Ricardo Araujo Matsumura ◽

Daniela Francisco Brauner

Keyword(s):

Machine Learning ◽

At Risk ◽

At Risk Students ◽

Drop Out ◽

Support Vector ◽

Learning Models ◽

Data Set ◽

Student Dropout ◽

Vector Machines ◽

Machine Learning Models

Student dropout is one of the main problems faced by distance learning courses. One of the major challenges for researchers is to develop methods to predict the behavior of students so that teachers and tutors are able to identify at-risk students as early as possible and provide assistance before they drop out or fail in their courses. Machine Learning models have been used to predict or classify students in these settings. However, while these models have shown promising results in several settings, they usually attain these results using attributes that are not immediately transferable to other courses or platforms. In this paper, we provide a methodology to classify students using only interaction counts from each student. We evaluate this methodology on a data set from two majors based on the Moodle platform. We run experiments consisting of training and evaluating three machine learning models (Support Vector Machines, Naive Bayes and Adaboost decision trees) under different scenarios. We provide evidences that patterns from interaction counts can provide useful information for classifying at-risk students. This classification allows the customization of the activities presented to at-risk students (automatically or through tutors) as an attempt to avoid students drop out.

Download Full-text

Machine learning based detection of volcano seismicity using the spatial pattern of amplitudes

Geophysical Journal International ◽

10.1093/gji/ggaa593 ◽

2020 ◽

Author(s):

Yuta Maeda ◽

Yoshiko Yamanaka ◽

Takeo Ito ◽

Shinichiro Horikawa

Keyword(s):

Machine Learning ◽

Large Amplitude ◽

Supervised Machine Learning ◽

Support Vector ◽

Seismic Events ◽

Amplitude Ratios ◽

Volcano Seismicity ◽

Using Data ◽

Summit Region ◽

Manual Classification

Summary We propose a new algorithm, focusing on spatial amplitude patterns, to automatically detect volcano seismic events from continuous waveforms. Candidate seismic events are detected based on signal-to-noise ratios. The algorithm then utilizes supervised machine learning to classify the existing candidate events into true and false categories. The input learning data are the ratios of the number of time samples with amplitudes greater than the background noise level at 1 s intervals (large amplitude ratios) given at every station site, and a manual classification table in which ‘true'' or ‘false'' flags are assigned to candidate events. A two-step approach is implemented in our procedure. First, using the large amplitude ratios at all stations, a neural network model representing a continuous spatial distribution of large amplitude probabilities is investigated at 1 s intervals. Second, several features are extracted from these spatial distributions, and a relation between the features and classification to true and false events is learned by a support vector machine. This two-step approach is essential to account for temporal loss of data, or station installation, movement, or removal. We evaluated the algorithm using data from Mt. Ontake, Japan, during the first ten days of a dense observation trial in the summit region (November 1–10, 2017). Results showed a classification accuracy of more than 97 per cent.

Download Full-text

Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance

AJIT-e Online Academic Journal of Information Technology ◽

10.5824/ajite.2020.01.001.x ◽

2020 ◽

Vol 11 (40) ◽

pp. 8-23

Author(s):

Pius MARTHIN ◽

Duygu İÇEN

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Semantic Analysis ◽

Classification Tree ◽

Supervised Machine Learning ◽

Training Dataset ◽

Support Vector ◽

Learning Models ◽

Machine Learning Models

Online product reviews have become a valuable source of information which facilitate customer decision with respect to a particular product. With the wealthy information regarding user's satisfaction and experiences about a particular drug, pharmaceutical companies make the use of online drug reviews to improve the quality of their products. Machine learning has enabled scientists to train more efficient models which facilitate decision making in various fields. In this manuscript we applied a drug review dataset used by (Gräβer, Kallumadi, Malberg,& Zaunseder, 2018), available freely from machine learning repository website of the University of California Irvine (UCI) to identify best machine learning model which provide a better prediction of the overall drug performance with respect to users' reviews. Apart from several manipulations done to improve model accuracy, all necessary procedures required for text analysis were followed including text cleaning and transformation of texts to numeric format for easy training machine learning models. Prior to modeling, we obtained overall sentiment scores for the reviews. Customer's reviews were summarized and visualized using a bar plot and word cloud to explore the most frequent terms. Due to scalability issues, we were able to use only the sample of the dataset. We randomly sampled 15000 observations from the 161297 training dataset and 10000 observations were randomly sampled from the 53766 testing dataset. Several machine learning models were trained using 10 folds cross-validation performed under stratified random sampling. The trained models include Classification and Regression Trees (CART), classification tree by C5.0, logistic regression (GLM), Multivariate Adaptive Regression Spline (MARS), Support vector machine (SVM) with both radial and linear kernels and a classification tree using random forest (Random Forest). Model selection was done through a comparison of accuracies and computational efficiency. Support vector machine (SVM) with linear kernel was significantly best with an accuracy of 83% compared to the rest. Using only a small portion of the dataset, we managed to attain reasonable accuracy in our models by applying the TF-IDF transformation and Latent Semantic Analysis (LSA) technique to our TDM.

Download Full-text

Sentiment Analysis and Topic Modeling on Tweets about Online Education during COVID-19

Applied Sciences ◽

10.3390/app11188438 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8438

Author(s):

Muhammad Mujahid ◽

Ernesto Lee ◽

Furqan Rustam ◽

Patrick Bernard Washington ◽

Saleem Ullah ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Online Education ◽

Sentiment Analysis ◽

Topic Modeling ◽

Support Vector ◽

Learning Approaches ◽

Learning Models ◽

E Learning ◽

Machine Learning Models

Amid the worldwide COVID-19 pandemic lockdowns, the closure of educational institutes leads to an unprecedented rise in online learning. For limiting the impact of COVID-19 and obstructing its widespread, educational institutions closed their campuses immediately and academic activities are moved to e-learning platforms. The effectiveness of e-learning is a critical concern for both students and parents, specifically in terms of its suitability to students and teachers and its technical feasibility with respect to different social scenarios. Such concerns must be reviewed from several aspects before e-learning can be adopted at such a larger scale. This study endeavors to investigate the effectiveness of e-learning by analyzing the sentiments of people about e-learning. Due to the rise of social media as an important mode of communication recently, people’s views can be found on platforms such as Twitter, Instagram, Facebook, etc. This study uses a Twitter dataset containing 17,155 tweets about e-learning. Machine learning and deep learning approaches have shown their suitability, capability, and potential for image processing, object detection, and natural language processing tasks and text analysis is no exception. Machine learning approaches have been largely used both for annotation and text and sentiment analysis. Keeping in view the adequacy and efficacy of machine learning models, this study adopts TextBlob, VADER (Valence Aware Dictionary for Sentiment Reasoning), and SentiWordNet to analyze the polarity and subjectivity score of tweets’ text. Furthermore, bearing in mind the fact that machine learning models display high classification accuracy, various machine learning models have been used for sentiment classification. Two feature extraction techniques, TF-IDF (Term Frequency-Inverse Document Frequency) and BoW (Bag of Words) have been used to effectively build and evaluate the models. All the models have been evaluated in terms of various important performance metrics such as accuracy, precision, recall, and F1 score. The results reveal that the random forest and support vector machine classifier achieve the highest accuracy of 0.95 when used with Bow features. Performance comparison is carried out for results of TextBlob, VADER, and SentiWordNet, as well as classification results of machine learning models and deep learning models such as CNN (Convolutional Neural Network), LSTM (Long Short Term Memory), CNN-LSTM, and Bi-LSTM (Bidirectional-LSTM). Additionally, topic modeling is performed to find the problems associated with e-learning which indicates that uncertainty of campus opening date, children’s disabilities to grasp online education, and lagging efficient networks for online education are the top three problems.

Download Full-text

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes (Preprint)

10.2196/preprints.8344 ◽

2017 ◽

Author(s):

Chin Lin ◽

Chia-Jung Hsu ◽

Yu-Sheng Lou ◽

Shih-Jen Yeh ◽

Chia-Cheng Lee ◽

...

Keyword(s):

Machine Learning ◽

Word Embedding ◽

Supervised Machine Learning ◽

Support Vector ◽

Free Text ◽

Learning Models ◽

Diagnosis Codes ◽

Icd 10 ◽

F Measure ◽

Machine Learning Models

BACKGROUND Automated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN). OBJECTIVE Our objective was to compare the performance of traditional pipelines (NLP plus supervised machine learning models) with that of word embedding combined with a CNN in conducting a classification task identifying International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis codes in discharge notes. METHODS We used 2 classification methods: (1) extracting from discharge notes some features (terms, n-gram phrases, and SNOMED CT categories) that we used to train a set of supervised machine learning models (support vector machine, random forests, and gradient boosting machine), and (2) building a feature matrix, by a pretrained word embedding model, that we used to train a CNN. We used these methods to identify the chapter-level ICD-10-CM diagnosis codes in a set of discharge notes. We conducted the evaluation using 103,390 discharge notes covering patients hospitalized from June 1, 2015 to January 31, 2017 in the Tri-Service General Hospital in Taipei, Taiwan. We used the receiver operating characteristic curve as an evaluation measure, and calculated the area under the curve (AUC) and F-measure as the global measure of effectiveness. RESULTS In 5-fold cross-validation tests, our method had a higher testing accuracy (mean AUC 0.9696; mean F-measure 0.9086) than traditional NLP-based approaches (mean AUC range 0.8183-0.9571; mean F-measure range 0.5050-0.8739). A real-world simulation that split the training sample and the testing sample by date verified this result (mean AUC 0.9645; mean F-measure 0.9003 using the proposed method). Further analysis showed that the convolutional layers of the CNN effectively identified a large number of keywords and automatically extracted enough concepts to predict the diagnosis codes. CONCLUSIONS Word embedding combined with a CNN showed outstanding performance compared with traditional methods, needing very little data preprocessing. This shows that future studies will not be limited by incomplete dictionaries. A large amount of unstructured information from free-text medical writing will be extracted by automated approaches in the future, and we believe that the health care field is about to enter the age of big data.

Download Full-text

COVID-19 Future Predictions Using 4 Supervised Machine Learning Models

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-2235 ◽

2022 ◽

pp. 61-66

Author(s):

Aditi Vadhavkar ◽

Pratiksha Thombare ◽

Priyanka Bhalerao ◽

Utkarsha Auti

Keyword(s):

Machine Learning ◽

Decision Making ◽

Support Vector Machine ◽

Supervised Machine Learning ◽

Perioperative Outcomes ◽

Support Vector ◽

Learning Models ◽

Death Rates ◽

The World ◽

Machine Learning Models

Forecasting Mechanisms like Machine Learning (ML) models having been proving their significance to anticipate perioperative outcomes in the domain of decision making on the future course of actions. Many application domains have witnessed the use of ML models for identification and prioritization of adverse factors for a threat. The spread of COVID-19 has proven to be a great threat to a mankind announcing it a worldwide pandemic throughout. Many assets throughout the world has faced enormous infectivity and contagiousness of this illness. To look at the figure of undermining components of COVID-19 we’ve specifically used four Machine Learning Models Linear Regression (LR), Least shrinkage and determination administrator (LASSO), Support vector machine (SVM) and Exponential smoothing (ES). The results depict that the ES performs best among the four models employed in this study, followed by LR and LASSO which performs well in forecasting the newly confirmed cases, death rates yet recovery rates, but SVM performs poorly all told the prediction scenarios given the available dataset.

Download Full-text

Daily Cryptocurrency Returns Forecasting and Trading via Machine Learning

Journal of Student Research ◽

10.47611/jsrhs.v10i4.2217 ◽

2021 ◽

Vol 10 (4) ◽

Author(s):

Andrew Falcon ◽

Tianshu Lyu

Keyword(s):

Machine Learning ◽

Support Vector ◽

Learning Models ◽

Investor Attention ◽

Vector Machines ◽

Price Trends ◽

Sharpe Ratios ◽

Returns Forecasting ◽

Machine Learning Models ◽

Significant Factors

We execute a comparative analysis of machine learning models for the time-series forecasting of the sign of next-day cryptocurrency returns. We begin by compiling a proprietary dataset that encompasses a wide array of potential cryptocurrency valuation factors (price trends, liquidity, volatility, network, production, investor attention), subsequently identifying and evaluating the most significant factors. We apply eight machine learning models to the dataset, utilizing them as classifiers to predict the sign of next day price returns for the three largest cryptocurrencies by market capitalization: bitcoin, ethereum, and ripple. We show that the most significant valuation factors for cryptocurrency returns are price trend variables, seven and thirty-day reversal, to be specific. We conclude that support vector machines result in the most accurate classifications for all three cryptocurrencies. Additionally, we find that boosted models like AdaBoost and XGBoost have the poorest classification accuracy. At length, we construct a probability-based trading strategy that secures either a daily long or short position on one of the three examined cryptocurrencies. Ultimately, the strategy yields a Sharpe of 2.8 and a cumulative log return of 3.72. On average, the strategy’s log returns outperformed standalone investments in all three cryptocurrencies by a factor of 5.64, and Sharpe ratios more than threefold.

Download Full-text

A Novel Approach of Using Feature-Based Machine Learning Models to Expand Coverage of Oil Saturation from Dielectric Logs

10.2118/205162-ms ◽

2021 ◽

Author(s):

Mohammed Alghazal ◽

Dimitrios Krinis

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Water Salinity ◽

Support Vector ◽

Petrophysical Properties ◽

Learning Models ◽

Heat Map ◽

Oil Saturation ◽

Log Data ◽

Machine Learning Models

Abstract Dielectric log is a specialized tool with proprietary procedures to predict oil saturation independent of water salinity. Conventional resistivity logging is more routinely used but dependent on water salinity and Archie's parameters, leading to high measurement uncertainty in mixed salinity environments. This paper presents a novel machine learning approach of propagating the coverage of dielectric-based oil saturation driven by features extracted from commonly available reservoir information, petrophysical properties and conventional log data. More than 20 features were extracted from several sources. Based on sampling frequency, extracted features are divided into well-based discrete features and petrophysical-based continuous features. Examples of well-based features include well location with respect to flank (east or west), fluid viscosities and densities, total dissolved solids from surface water, distance to nearest water injector and injection volume. Petrophysical-based features include height above free water level (HAFWL), porosity, modelled permeability, initial water saturation, resistivity-based saturation, rock-type and caliper. In addition, we engineered two new depth-related and continuous features, we call them Height-Below-Crest (HBC) and Height-Above-Top-Injector-Zone (HATIZ). Initial data exploration was performed using Pearson's correlation heat map. Fluid densities and viscosities show strong correlation (60-80%) to the engineered features (HBC and HATIZ), which helped to capture the viscous and gravity forces effect across the well's vertical depth. The heat map also shows weak correlation between the features and the target variable, the oil saturation from dielectric log. The dataset, with 5000 samples, was randomly split into 80% training and 20% testing. A robust scaling technique to outliers is used to scale the features prior to modeling. The preliminary performance of various supervised machine learning models, including decision trees, ensemble methods, neural network and support vector machines, were benchmarked using K-Fold cross-validation on the training data prior to testing. Ensemble-based methods, random forest and gradient boosting, produced the least mean absolute error compared to other methods and thus were selected for further hyper-parameter tuning. Exhaustive grid search was performed on both models to find the best-fit parameters, achieving a correlation coefficient of 70% on the testing dataset. Features analysis indicate that the engineered features, HBC and HATIZ, along with the porosity, HAFWL and resistivity-based saturation are the most importance features for predicting the oil saturation from dielectric log. Dielectric log provides an edge over resistivity-based logging technique in mixed salinity formations, but with more elaborate interpretation procedures. In this paper, we present a soft-computing and economical alternative of using ensemble machine learning models to predict oil saturation from dielectric log given some extracted features from common reservoir information, petrophysical properties and conventional log data.

Download Full-text