I NTRODUCING A NEW T ECHNICAL I NDICATOR BASED ON OCTAV O NICESCU I NFORMATIONAL E NERGY AND COMPARE IT WITH B OLLINGER BANDS FOR S&P 500 M OVEMENT P REDICTIONS

The Future ◽

Tangible Evidence ◽

Market Trends

This research paper demonstrates the invention of the kinetic bands, based on Romanian mathematician and statistician Octav Onicescu’s kinetic energy, also known as “informational energy”, where we use historical data of foreign exchange currencies or indexes to predict the trend displayed by a stock or an index and whether it will go up or down in the future. Here, we explore the imperfections of the Bollinger Bands to determine a more sophisticated triplet of indicators that predict the future movement of prices in the Stock Market. An Extreme Gradient Boosting Modelling was conducted in Python using historical data set from Kaggle, the historical data set spanning all current 500 companies listed. An invariable importance feature was plotted. The results displayed that Kinetic Bands, derived from (KE) are very influential as features or technical indicators of stock market trends. Furthermore, experiments done through this invention provide tangible evidence of the empirical aspects of it. The machine learning code has low chances of error if all the proper procedures and coding are in play. The experiment samples are attached to this study for future references or scrutiny.

Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00028 ◽

2020 ◽

pp. 865-874

Author(s):

Enrico Santus ◽

Tal Schuster ◽

Amir M. Tahmasebi ◽

Clara Li ◽

Adam Yala ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Systems ◽

High Performance ◽

Feature Model ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Pathology Reports

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis (Preprint)

10.2196/preprints.27344 ◽

2021 ◽

Author(s):

Sang Min Nam ◽

Thomas A Peterson ◽

Kyoung Yul Seo ◽

Hyun Wook Han ◽

Jee In Kang

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Network Analysis ◽

Survey Data ◽

Associated Factors ◽

Statistical Tests ◽

Epidemiological Studies ◽

Gradient Boosting ◽

Data Set ◽

BACKGROUND In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. OBJECTIVE Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. METHODS An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. RESULTS The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (<i>P</i><.05) and indirect (<i>P</i>≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. CONCLUSIONS XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.

A Machine Learning Study to Improve Surgical Case Duration Prediction

10.21203/rs.3.rs-40927/v1 ◽

2020 ◽

Author(s):

Ching-Chieh Huang ◽

Jesyin Lai ◽

Der-Yang Cho ◽

Jiaxin Yu

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Healthcare Management ◽

Gradient Boosting ◽

External Evaluation ◽

Data Set ◽

Surgical Case ◽

Case Duration ◽

Duration Prediction

Abstract Since the emergence of COVID-19, many hospitals have encountered challenges in performing efficient scheduling and good resource management to ensure the quality of healthcare provided to patients is not compromised. Operating room (OR) scheduling is one of the issues that has gained our attention because it is related to workflow efficiency and critical care of hospitals. Automatic scheduling and high predictive accuracy of surgical case duration have a critical role in improving OR utilization. To estimate surgical case duration, many hospitals rely on historic averages based on a specific surgeon or a specific procedure type obtained from electronic medical record (EMR) scheduling systems. However, the low predictive accuracy with EMR data leads to negative impacts on patients and hospitals, such as rescheduling of surgeries and cancellation. In this study, we aim to improve the prediction of surgical case duration with advanced machine learning (ML) algorithms. We obtained a large data set containing 170,748 surgical cases (from Jan 2017 to Dec 2019) from a hospital. The data covered a broad variety of details on patients, surgeries, specialties and surgical teams. In addition, a more recent data set with 8,672 cases (from Mar to Apr 2020) was available to be used for external evaluation. We computed historic averages from the EMR data for surgeon- or procedure-specific cases, and they were used as baseline models for comparison. Subsequently, we developed our models using linear regression, random forest and extreme gradient boosting (XGB) algorithms. All models were evaluated with R-square (R2), mean absolute error (MAE), and percentage overage (actual duration longer than prediction), underage (shorter than prediction) and within (within prediction). The XGB model was superior to the other models, achieving a higher R2 (85 %) and percentage within (48 %) as well as a lower MAE (30.2 min). The total prediction errors computed for all models showed that the XGB model had the lowest inaccurate percentage (23.7 %). Overall, this study applied ML techniques in the field of OR scheduling to reduce the medical and financial burden for healthcare management. The results revealed the importance of surgery and surgeon factors in surgical case duration prediction. This study also demonstrated the importance of performing an external evaluation to better validate the performance of ML models.

Application of Machine Learning to Interpret Steady State Drainage Relative Permeability Experiments

10.2118/207877-ms ◽

2021 ◽

Author(s):

Eric Sonny Mathew ◽

Moussa Tembely ◽

Waleed AlAmeri ◽

Emad W. Al-Shalabi ◽

Abdul Ravoof Shaik

Keyword(s):

Neural Network ◽

Machine Learning ◽

Experimental Data ◽

Steady State ◽

Relative Permeability ◽

Learning Model ◽

Gradient Boosting ◽

Data Set ◽

Machine Learning Model ◽

Abstract A meticulous interpretation of steady-state or unsteady-state relative permeability (Kr) experimental data is required to determine a complete set of Kr curves. In this work, three different machine learning models was developed to assist in a faster estimation of these curves from steady-state drainage coreflooding experimental runs. The three different models that were tested and compared were extreme gradient boosting (XGB), deep neural network (DNN) and recurrent neural network (RNN) algorithms. Based on existing mathematical models, a leading edge framework was developed where a large database of Kr and Pc curves were generated. This database was used to perform thousands of coreflood simulation runs representing oil-water drainage steady-state experiments. The results obtained from these simulation runs, mainly pressure drop along with other conventional core analysis data, were utilized to estimate Kr curves based on Darcy's law. These analytically estimated Kr curves along with the previously generated Pc curves were fed as features into the machine learning model. The entire data set was split into 80% for training and 20% for testing. K-fold cross validation technique was applied to increase the model accuracy by splitting the 80% of the training data into 10 folds. In this manner, for each of the 10 experiments, 9 folds were used for training and the remaining one was used for model validation. Once the model is trained and validated, it was subjected to blind testing on the remaining 20% of the data set. The machine learning model learns to capture fluid flow behavior inside the core from the training dataset. The trained/tested model was thereby employed to estimate Kr curves based on available experimental results. The performance of the developed model was assessed using the values of the coefficient of determination (R2) along with the loss calculated during training/validation of the model. The respective cross plots along with comparisons of ground-truth versus AI predicted curves indicate that the model is capable of making accurate predictions with error percentage between 0.2 and 0.6% on history matching experimental data for all the three tested ML techniques (XGB, DNN, and RNN). This implies that the AI-based model exhibits better efficiency and reliability in determining Kr curves when compared to conventional methods. The results also include a comparison between classical machine learning approaches, shallow and deep neural networks in terms of accuracy in predicting the final Kr curves. The various models discussed in this research work currently focusses on the prediction of Kr curves for drainage steady-state experiments; however, the work can be extended to capture the imbibition cycle as well.

A Critical Literature Review on Rock Petrophysical Properties Estimation from Images Based on Direct Simulation and Machine Learning Techniques

10.2118/208125-ms ◽

2021 ◽

Author(s):

Ahmed Samir Rizk ◽

Moussa Tembely ◽

Waleed AlAmeri ◽

Emad W. Al-Shalabi

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Literature Review ◽

Training Data ◽

Rock Properties ◽

Gradient Boosting ◽

Petrophysical Properties ◽

Direct Simulation ◽

Data Set ◽

Abstract Estimation of petrophysical properties is essential for accurate reservoir predictions. In recent years, extensive work has been dedicated into training different machine-learning (ML) models to predict petrophysical properties of digital rock using dry rock images along with data from single-phase direct simulations, such as lattice Boltzmann method (LBM) and finite volume method (FVM). The objective of this paper is to present a comprehensive literature review on petrophysical properties estimation from dry rock images using different ML workflows and direct simulation methods. The review provides detailed comparison between different ML algorithms that have been used in the literature to estimate porosity, permeability, tortuosity, and effective diffusivity. In this paper, various ML workflows from the literature are screened and compared in terms of the training data set, the testing data set, the extracted features, the algorithms employed as well as their accuracy. A thorough description of the most commonly used algorithms is also provided to better understand the functionality of these algorithms to encode the relationship between the rock images and their respective petrophysical properties. The review of various ML workflows for estimating rock petrophysical properties from dry images shows that models trained using features extracted from the image (physics-informed models) outperformed models trained on the dry images directly. In addition, certain tree-based ML algorithms, such as random forest, gradient boosting, and extreme gradient boosting can produce accurate predictions that are comparable to deep learning algorithms such as deep neural networks (DNNs) and convolutional neural networks (CNNs). To the best of our knowledge, this is the first work dedicated to exploring and comparing between different ML frameworks that have recently been used to accurately and efficiently estimate rock petrophysical properties from images. This work will enable other researchers to have a broad understanding about the topic and help in developing new ML workflows or further modifying exiting ones in order to improve the characterization of rock properties. Also, this comparison represents a guide to understand the performance and applicability of different ML algorithms. Moreover, the review helps the researchers in this area to cope with digital innovations in porous media characterization in this fourth industrial age – oil and gas 4.0.

Prediction of Radiation Pneumonitis With Machine Learning in Stage III Lung Cancer: A Pilot Study

Technology in Cancer Research & Treatment ◽

10.1177/15330338211016373 ◽

2021 ◽

Vol 20 ◽

pp. 153303382110163

Author(s):

Melek Yakar ◽

Durmus Etiz ◽

Muzaffer Metintas ◽

Guntulu Ak ◽

Ozer Celik

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Radiation Pneumonitis ◽

Stage Iii ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Volume Number ◽

Light Gradient ◽

Background: Radiation pneumonitis (RP) is a dose-limiting toxicity in lung cancer radiotherapy (RT). As risk factors in the development of RP, patient and tumor characteristics, dosimetric parameters, and treatment features are intertwined, and it is not always possible to associate RP with a single parameter. This study aimed to determine the algorithm that most accurately predicted RP development with machine learning. Methods: Of the 197 cases diagnosed with stage III lung cancer and underwent RT and chemotherapy between 2014 and 2020, 193 were evaluated. The CTCAE 5.0 grading system was used for the RP evaluation. Synthetic minority oversampling technique was used to create a balanced data set. Logistic regression, artificial neural networks, eXtreme Gradient Boosting (XGB), Support Vector Machines, Random Forest, Gaussian Naive Bayes and Light Gradient Boosting Machine algorithms were used. After the correlation analysis, a permutation-based method was utilized for as a variable selection. Results: RP was seen in 51 of the 193 cases. Parameters affecting RP were determined as, total(t)V5, ipsilateral lung Dmax, contralateral lung Dmax, total lung Dmax, gross tumor volume, number of chemotherapy cycles before RT, tumor size, lymph node localization and asbestos exposure. LGBM was found to be the algorithm that best predicted RP at 85% accuracy (confidence interval: 0.73-0.96), 97% sensitivity, and 50% specificity. Conclusion: When the clinical and dosimetric parameters were evaluated together, the LGBM algorithm had the highest accuracy in predicting RP. However, in order to use this algorithm in clinical practice, it is necessary to increase data diversity and the number of patients by sharing data between centers.

Extreme Gradient Boosting for Parkinson’s Disease Diagnosis from Voice Recordings

10.21203/rs.2.20727/v1 ◽

2020 ◽

Author(s):

Ibrahim Karabayir ◽

Suguna Pappu ◽

Samuel Goldman ◽

Oguz Akbilgic

Keyword(s):

Machine Learning ◽

Parkinson’S Disease ◽

Parkinson's Disease ◽

Disease Diagnosis ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Acoustic Features ◽

Data Set ◽

Non Invasive ◽

Abstract Background : Parkinson’s Disease (PD) is a clinically diagnosed neurodegenerative disorder that affects both motor and non-motor neural circuits. Speech deterioration (hypokinetic dysarthria) is a common symptom, which often presents early in the disease course. Machine learning can help movement disorders specialists improve their diagnostic accuracy using non-invasive and inexpensive voice recordings. Method : We used “Parkinson Dataset with Replicated Acoustic Features Data Set” from the UCI-Machine Learning repository. The dataset included 45 features including sex and 44 speech test based acoustic features from 40 patients with Parkinson’s disease and 40 controls. We analyzed the data using various machine learning algorithms including tree-based ensemble approaches such as random forest and extreme gradient boosting. We also implemented a variable importance analysis to identify important variables classifying patients with PD. Results : The cohort included total of 80 subjects; 40 patients with PD (55% men) and 40 controls (67.5% men). PD patients showed at least two of the three symptoms; resting tremor, bradykinesia, or rigidity. All patients were over 50 years old and the mean age for PD subjects and controls were 69.6 (SD 7.8) and 66.4 (SD 8.4), respectively. Our final model provided an AUC of 0.940 with 95% confidence interval 0.935-0.945in 4-folds cross validation using only six acoustic features including Delta3 (Run2), Delta0 (Run 3), MFCC4 (Run 2), Delta10 (Run 2/Run 3), MFCC10 (Run 2) and Jitter_Rap (Run 1/Run 2). Conclusions : Machine learning can accurately detect Parkinson’s disease using an inexpensive and non-invasive voice recording. Such technologies can be deployed into smartphones for screening of large patient populations for Parkinson’s disease.

Predictability of Mortality in Patients With Myocardial Injury After Noncardiac Surgery Based on Perioperative Factors via Machine Learning: Retrospective Study

JMIR Medical Informatics ◽

10.2196/32771 ◽

2021 ◽

Vol 9 (10) ◽

pp. e32771

Author(s):

Seo Jeong Shin ◽

Jungchan Park ◽

Seung-Hwa Lee ◽

Kwangmo Yang ◽

Rae Woong Park

Keyword(s):

Machine Learning ◽

Clinical Data ◽

Myocardial Injury ◽

Learning Algorithms ◽

Noncardiac Surgery ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Data Set ◽

The Impact

Background Myocardial injury after noncardiac surgery (MINS) is associated with increased postoperative mortality, but the relevant perioperative factors that contribute to the mortality of patients with MINS have not been fully evaluated. Objective To establish a comprehensive body of knowledge relating to patients with MINS, we researched the best performing predictive model based on machine learning algorithms. Methods Using clinical data from 7629 patients with MINS from the clinical data warehouse, we evaluated 8 machine learning algorithms for accuracy, precision, recall, F1 score, area under the receiver operating characteristic (AUROC) curve, and area under the precision-recall curve to investigate the best model for predicting mortality. Feature importance and Shapley Additive Explanations values were analyzed to explain the role of each clinical factor in patients with MINS. Results Extreme gradient boosting outperformed the other models. The model showed an AUROC of 0.923 (95% CI 0.916-0.930). The AUROC of the model did not decrease in the test data set (0.894, 95% CI 0.86-0.922; P=.06). Antiplatelet drugs prescription, elevated C-reactive protein level, and beta blocker prescription were associated with reduced 30-day mortality. Conclusions Predicting the mortality of patients with MINS was shown to be feasible using machine learning. By analyzing the impact of predictors, markers that should be cautiously monitored by clinicians may be identified.

A Predictive Mimicker of Fracture Behavior in Fiber Reinforced Concrete Using Machine Learning

Materials ◽

10.3390/ma14247669 ◽

2021 ◽

Vol 14 (24) ◽

pp. 7669

Author(s):

Sikandar Ali Khokhar ◽

Touqeer Ahmed ◽

Rao Arsalan Khushnood ◽

Syed Muhammad Ali ◽

Shahnawaz

Keyword(s):

Machine Learning ◽

Reinforced Concrete ◽

Fracture Behavior ◽

Fiber Reinforced Concrete ◽

Gradient Boosting ◽

Support Vector ◽

Fiber Reinforced ◽

Mixed Design ◽

Data Set ◽

Due to the exceptional qualities of fiber reinforced concrete, its application is expanding day by day. However, its mixed design is mainly based on extensive experimentations. This study aims to construct a machine learning model capable of predicting the fracture behavior of all conceivable fiber reinforced concrete subclasses, especially strain hardening engineered cementitious composites. This study evaluates 15x input parameters that include the ingredients of the mixed design and the fiber properties. As a result, it predicts, for the first time, the post-peak fracture behavior of fiber-reinforced concrete matrices. Five machine learning models are developed, and their outputs are compared. These include artificial neural networks, the support vector machine, the classification and regression tree, the Gaussian process of regression, and the extreme gradient boosting tree. Due to the small size of the available dataset, this article employs a unique technique called the generative adversarial network to build a virtual data set to augment the data and improve accuracy. The results indicate that the extreme gradient boosting tree model has the lowest error and, therefore, the best mimicker in predicting fiber reinforced concrete properties. This article is anticipated to provide a considerable improvement in the recipe design of effective fiber reinforced concrete formulations.

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis

Journal of Medical Internet Research ◽

10.2196/27344 ◽

2021 ◽

Vol 23 (6) ◽

pp. e27344

Author(s):

Sang Min Nam ◽

Thomas A Peterson ◽

Kyoung Yul Seo ◽

Hyun Wook Han ◽

Jee In Kang

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Network Analysis ◽

Survey Data ◽

Associated Factors ◽

Statistical Tests ◽

Epidemiological Studies ◽

Gradient Boosting ◽

Data Set ◽

Background In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. Objective Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. Methods An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. Results The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (P<.05) and indirect (P≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. Conclusions XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.