Developing machine learning models to predict CO2 trapping performance in deep saline aquifers

Mapping Intimacies ◽

10.21203/rs.3.rs-587644/v1 ◽

2021 ◽

Author(s):

Hung Vo Thanh ◽

Kang-Kun Lee

Keyword(s):

Machine Learning ◽

Gaussian Process Regression ◽

Correlation Factor ◽

Trapping Efficiency ◽

Training Dataset ◽

Saline Aquifers ◽

Support Vector ◽

Predictive Tool ◽

Simulated Field ◽

Trapping Performance

Abstract Deep saline formations are considered as potential sites for geological carbon storage (GCS). To better understand the CO2 trapping mechanism in saline aquifers, it is necessary to develop robust tools to evaluate CO2 trapping efficiency. This paper introduces the application of Gaussian process regression (GPR), support vector machine (SVM), and random forest (RF) to predict CO2 trapping efficiency in saline formations. First, the uncertainty variables, including geologic parameters, petrophysical properties, and other physical characteristics data were utilized to create a training dataset. A total of 101 reservoir simulation samples were then performed, and the residual trapping, solubility trapping, and cumulative CO2 injection were collected. The predicted results indicate that three machine learning (ML) models that evaluate performance from high to low: GPR, SVM, and RF can be selected to predict the CO2 trapping efficiency in deep saline formations. The GPR model has an excellent CO2 trapping prediction efficiency with the highest correlation factor (R2 = 0.992) and lowest root mean square error (RMSE = 0.00491). The accuracy and stability of the GPR models were verified for an actual reservoir in offshore Vietnam. The predictive models obtained a good agreement between the simulated field and the predicted trapping index. These findings indicate that the GPR ML models can support the numerical simulation as a robust predictive tool for estimating the performance of CO2 trapping in the subsurface.

Download Full-text

Predicting CO2 Trapping Efficiency In Saline Aquifers By Machine Learning System: Implication To Carbon Sequestration

10.21203/rs.3.rs-841564/v1 ◽

2021 ◽

Author(s):

Hung Vo-Thanh ◽

Kang-Kun Lee

Keyword(s):

Machine Learning ◽

Model Performance ◽

Learning System ◽

Trapping Efficiency ◽

Saline Aquifers ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Data Points ◽

Saline Formation

Abstract Carbon dioxide (CO2) storage in saline formations has been identified as a practical approach to reducing CO2 levels in the atmosphere. The residual and solubility of CO2 in deep saline aquifers are essential mechanisms to enhance security in storing CO2. In this research, CO2 residual and solubility in saline formations have been predicted by adapting three Machine Learning models called Random Forest (RF), extreme gradient boosting (XGboost), and Support Vector Regression (SVR). Consequently, a diversity of the field-scale simulation database including 1509 data samples retrieved from reliable studies, was considered to train and test the proposed models to achieve this task. Graphical and statistical indicators were evaluated and compared the predictive ML model performance. The predicted results denoted that the proposed ML models are ranked from high to low as follows: XGboost>RF>SVR. Additionally, the performance analyses revealed that the XGboost model demonstrates higher accuracy in predicting CO2 trapping efficiency in saline formation than previous ML models. The XGboost model yields very low root mean square error (RMSE) and R2 for both residual and solubility trapping efficiency. At last, the applicable domain of XGboost model was validated, and only 24 suspected data points were recognized from the entire databank.

Download Full-text

Proposing a machine-learning based method to predict stillbirth before and during delivery and ranking the features: nationwide retrospective cross-sectional study

BMC Pregnancy and Childbirth ◽

10.1186/s12884-021-03658-z ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Toktam Khatibi ◽

Elham Hanifi ◽

Mohammad Mehdi Sepehri ◽

Leila Allahqoli

Keyword(s):

Machine Learning ◽

External Validation ◽

Fetal Loss ◽

Null Distribution ◽

Training Dataset ◽

Gradient Boosting ◽

Support Vector ◽

Cross Sectional ◽

Boosting Method ◽

Demographic Features

Abstract Background Stillbirth is defined as fetal loss in pregnancy beyond 28 weeks by WHO. In this study, a machine-learning based method is proposed to predict stillbirth from livebirth and discriminate stillbirth before and during delivery and rank the features. Method A two-step stack ensemble classifier is proposed for classifying the instances into stillbirth and livebirth at the first step and then, classifying stillbirth before delivery from stillbirth during the labor at the second step. The proposed SE has two consecutive layers including the same classifiers. The base classifiers in each layer are decision tree, Gradient boosting classifier, logistics regression, random forest and support vector machines which are trained independently and aggregated based on Vote boosting method. Moreover, a new feature ranking method is proposed in this study based on mean decrease accuracy, Gini Index and model coefficients to find high-ranked features. Results IMAN registry dataset is used in this study considering all births at or beyond 28th gestational week from 2016/04/01 to 2017/01/01 including 1,415,623 live birth and 5502 stillbirth cases. A combination of maternal demographic features, clinical history, fetal properties, delivery descriptors, environmental features, healthcare service provider descriptors and socio-demographic features are considered. The experimental results show that our proposed SE outperforms the compared classifiers with the average accuracy of 90%, sensitivity of 91%, specificity of 88%. The discrimination of the proposed SE is assessed and the average AUC of ±95%, CI of 90.51% ±1.08 and 90% ±1.12 is obtained on training dataset for model development and test dataset for external validation, respectively. The proposed SE is calibrated using isotopic nonparametric calibration method with the score of 0.07. The process is repeated 10,000 times and AUC of SE classifiers using random different training datasets as null distribution. The obtained p-value to assess the specificity of the proposed SE is 0.0126 which shows the significance of the proposed SE. Conclusions Gestational age and fetal height are two most important features for discriminating livebirth from stillbirth. Moreover, hospital, province, delivery main cause, perinatal abnormality, miscarriage number and maternal age are the most important features for classifying stillbirth before and during delivery.

Download Full-text

KDClassifier: Urinary Proteomic Spectra Analysis Based on Machine Learning for Classification of Kidney Diseases

10.1101/2020.12.01.20242198 ◽

2020 ◽

Author(s):

Wanjun Zhao ◽

Yong Zhang ◽

Xinming Li ◽

Yonghong Mao ◽

Changwei Wu ◽

...

Keyword(s):

Machine Learning ◽

Mass Spectrum ◽

Kidney Disease ◽

Kidney Diseases ◽

Training Dataset ◽

Validation Dataset ◽

Support Vector ◽

Urinary Proteomics ◽

Diagnosis Model

AbstractBackgroundBy extracting the spectrum features from urinary proteomics based on an advanced mass spectrometer and machine learning algorithms, more accurate reporting results can be achieved for disease classification. We attempted to establish a novel diagnosis model of kidney diseases by combining machine learning with an extreme gradient boosting (XGBoost) algorithm with complete mass spectrum information from the urinary proteomics.MethodsWe enrolled 134 patients (including those with IgA nephropathy, membranous nephropathy, and diabetic kidney disease) and 68 healthy participants as a control, and for training and validation of the diagnostic model, applied a total of 610,102 mass spectra from their urinary proteomics produced using high-resolution mass spectrometry. We divided the mass spectrum data into a training dataset (80%) and a validation dataset (20%). The training dataset was directly used to create a diagnosis model using XGBoost, random forest (RF), a support vector machine (SVM), and artificial neural networks (ANNs). The diagnostic accuracy was evaluated using a confusion matrix. We also constructed the receiver operating-characteristic, Lorenz, and gain curves to evaluate the diagnosis model.ResultsCompared with RF, the SVM, and ANNs, the modified XGBoost model, called a Kidney Disease Classifier (KDClassifier), showed the best performance. The accuracy of the diagnostic XGBoost model was 96.03% (CI = 95.17%-96.77%; Kapa = 0.943; McNemar’s Test, P value = 0.00027). The area under the curve of the XGBoost model was 0.952 (CI = 0.9307-0.9733). The Kolmogorov-Smirnov (KS) value of the Lorenz curve was 0.8514. The Lorenz and gain curves showed the strong robustness of the developed model.ConclusionsThis study presents the first XGBoost diagnosis model, i.e., the KDClassifier, combined with complete mass spectrum information from the urinary proteomics for distinguishing different kidney diseases. KDClassifier achieves a high accuracy and robustness, providing a potential tool for the classification of all types of kidney diseases.

Download Full-text

Development of detection method for automatic hemostasis using machine learning with abdominal cavity irrigation

International Surgery Journal ◽

10.18203/2349-2902.isj20202547 ◽

2020 ◽

Vol 7 (7) ◽

pp. 2103

Author(s):

Yoshihisa Matsunaga ◽

Ryoichi Nakamura

Keyword(s):

Machine Learning ◽

Minimally Invasive Surgery ◽

Minimally Invasive ◽

Real Time ◽

Invasive Surgery ◽

Abdominal Cavity ◽

Training Dataset ◽

Support Vector ◽

Processing Cost ◽

Surgical Field

Background: Abdominal cavity irrigation is a more minimally invasive surgery than that using a gas. Minimally invasive surgery improves the quality of life of patients; however, it demands higher skills from the doctors. Therefore, the study aimed to reduce the burden by assisting and automating the hemostatic procedure a highly frequent procedure by taking advantage of the clearness of the endoscopic images and continuous bleeding point observations in the liquid. We aimed to construct a method for detecting organs, bleeding sites, and hemostasis regions.Methods: We developed a method to perform real-time detection based on machine learning using laparoscopic videos. Our training dataset was prepared from three experiments in pigs. Linear support vector machine was applied using new color feature descriptors. In the verification of the accuracy of the classifier, we performed five-part cross-validation. Classification processing time was measured to verify the real-time property. Furthermore, we visualized the time series class change of the surgical field during the hemostatic procedure.Results: The accuracy of our classifier was 98.3% and the processing cost to perform real-time was enough. Furthermore, it was conceivable to quantitatively indicate the completion of the hemostatic procedure based on the changes in the bleeding region by ablation and the hemostasis regions by tissue coagulation.Conclusions: The organs, bleeding sites, and hemostasis regions classification was useful for assisting and automating the hemostatic procedure in the liquid. Our method can be adapted to more hemostatic procedures.

Download Full-text

Detection of misinformation on garlic and COVID-19 in Twitter: A machine learning-based approach (Preprint)

10.2196/preprints.33056 ◽

2021 ◽

Author(s):

Myeong Gyu Kim ◽

Jae Hyun Kim ◽

Kyungim Kim

Keyword(s):

Machine Learning ◽

Social Media ◽

Latent Dirichlet Allocation ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Polynomial Kernel ◽

Support Vector ◽

Accurate Information ◽

Probability Number

BACKGROUND Garlic-related misinformation is prevalent whenever a virus outbreak occurs. Again, with the outbreak of coronavirus disease 2019 (COVID-19), garlic-related misinformation is spreading through social media sites, including Twitter. Machine learning-based approaches can be used to detect misinformation from vast tweets. OBJECTIVE This study aimed to develop machine learning algorithms for detecting misinformation on garlic and COVID-19 in Twitter. METHODS This study used 5,929 original tweets mentioning garlic and COVID-19. Tweets were manually labeled as misinformation, accurate information, and others. We tested the following algorithms: k-nearest neighbors; random forest; support vector machine (SVM) with linear, radial, and polynomial kernels; and neural network. Features for machine learning included user-based features (verified account, user type, number of followers, and follower rate) and text-based features (uniform resource locator, negation, sentiment score, Latent Dirichlet Allocation topic probability, number of retweets, and number of favorites). A model with the highest accuracy in the training dataset (70% of overall dataset) was tested using a test dataset (30% of overall dataset). Predictive performance was measured using overall accuracy, sensitivity, specificity, and balanced accuracy. RESULTS SVM with the polynomial kernel model showed the highest accuracy of 0.670. The model also showed a balanced accuracy of 0.757, sensitivity of 0.819, and specificity of 0.696 for misinformation. Important features in the misinformation and accurate information classes included topic 4 (common myths), topic 13 (garlic-specific myths), number of followers, topic 11 (misinformation on social media), and follower rate. Topic 3 (cooking recipes) was the most important feature in the others class. CONCLUSIONS Our SVM model showed good performance in detecting misinformation. The results of our study will help detect misinformation related to garlic and COVID-19. It could also be applied to prevent misinformation related to dietary supplements in the event of a future outbreak of a disease other than COVID-19.

Download Full-text

Machine Learning Readmission Risk Modeling: A Pediatric Case Study

BioMed Research International ◽

10.1155/2019/8532892 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Patricio Wolff ◽

Manuel Graña ◽

Sebastián A. Ríos ◽

Maria Begoña Yarza

Keyword(s):

Machine Learning ◽

Multilayer Perceptron ◽

Naive Bayes ◽

Class Imbalance ◽

Predictive Performance ◽

Naïve Bayes ◽

Distribution Model ◽

Training Dataset ◽

Support Vector ◽

Pediatric Hospital

Background. Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions.Objective. To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile.Materials. An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child’s treatment administrative cost.Methods. Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size.Results. Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms.Conclusions. We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.

Download Full-text

Artificial Intelligence (AI) to the Rescue: Deploying Machine Learning to Bridge the Biorelevance Gap in Antioxidant Assays

SLAS TECHNOLOGY Translating Life Sciences Innovation ◽

10.1177/2472630320962716 ◽

2020 ◽

pp. 247263032096271

Author(s):

Sunday Olakunle Idowu ◽

Amos Akintayo Fatokun

Keyword(s):

Oxidative Stress ◽

Artificial Intelligence ◽

Machine Learning ◽

Antioxidant Capacity ◽

Unmet Need ◽

Hydrogen Atom Transfer ◽

Support Vector ◽

Antioxidant Action ◽

Predictive Tool

Oxidative stress induced by excessive levels of reactive oxygen species (ROS) underlies several diseases. Therapeutic strategies to combat oxidative damage are, therefore, a subject of intense scientific investigation to prevent and treat such diseases, with the use of phytochemical antioxidants, especially polyphenols, being a major part. Polyphenols, however, exhibit structural diversity that determines different mechanisms of antioxidant action, such as hydrogen atom transfer (HAT) and single-electron transfer (SET). They also suffer from inadequate in vivo bioavailability, with their antioxidant bioactivity governed by permeability, gut-wall and first-pass metabolism, and HAT-based ROS trapping. Unfortunately, no current antioxidant assay captures these multiple dimensions to be sufficiently “biorelevant,” because the assays tend to be unidimensional, whereas biorelevance requires integration of several inputs. Finding a method to reliably evaluate the antioxidant capacity of these phytochemicals, therefore, remains an unmet need. To address this deficiency, we propose using artificial intelligence (AI)-based machine learning (ML) to relate a polyphenol’s antioxidant action as the output variable to molecular descriptors (factors governing in vivo antioxidant activity) as input variables, in the context of a biomarker selectively produced by lipid peroxidation (a consequence of oxidative stress), for example F2-isoprostanes. Support vector machines, artificial neural networks, and Bayesian probabilistic learning are some key algorithms that could be deployed. Such a model will represent a robust predictive tool in assessing biorelevant antioxidant capacity of polyphenols, and thus facilitate the identification or design of antioxidant molecules. The approach will also help to fulfill the principles of the 3Rs (replacement, reduction, and refinement) in using animals in biomedical research.

Download Full-text

Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance

AJIT-e Online Academic Journal of Information Technology ◽

10.5824/ajite.2020.01.001.x ◽

2020 ◽

Vol 11 (40) ◽

pp. 8-23

Author(s):

Pius MARTHIN ◽

Duygu İÇEN

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Semantic Analysis ◽

Classification Tree ◽

Supervised Machine Learning ◽

Training Dataset ◽

Support Vector ◽

Learning Models ◽

Machine Learning Models

Online product reviews have become a valuable source of information which facilitate customer decision with respect to a particular product. With the wealthy information regarding user's satisfaction and experiences about a particular drug, pharmaceutical companies make the use of online drug reviews to improve the quality of their products. Machine learning has enabled scientists to train more efficient models which facilitate decision making in various fields. In this manuscript we applied a drug review dataset used by (Gräβer, Kallumadi, Malberg,& Zaunseder, 2018), available freely from machine learning repository website of the University of California Irvine (UCI) to identify best machine learning model which provide a better prediction of the overall drug performance with respect to users' reviews. Apart from several manipulations done to improve model accuracy, all necessary procedures required for text analysis were followed including text cleaning and transformation of texts to numeric format for easy training machine learning models. Prior to modeling, we obtained overall sentiment scores for the reviews. Customer's reviews were summarized and visualized using a bar plot and word cloud to explore the most frequent terms. Due to scalability issues, we were able to use only the sample of the dataset. We randomly sampled 15000 observations from the 161297 training dataset and 10000 observations were randomly sampled from the 53766 testing dataset. Several machine learning models were trained using 10 folds cross-validation performed under stratified random sampling. The trained models include Classification and Regression Trees (CART), classification tree by C5.0, logistic regression (GLM), Multivariate Adaptive Regression Spline (MARS), Support vector machine (SVM) with both radial and linear kernels and a classification tree using random forest (Random Forest). Model selection was done through a comparison of accuracies and computational efficiency. Support vector machine (SVM) with linear kernel was significantly best with an accuracy of 83% compared to the rest. Using only a small portion of the dataset, we managed to attain reasonable accuracy in our models by applying the TF-IDF transformation and Latent Semantic Analysis (LSA) technique to our TDM.

Download Full-text

Preparedness and Mitigation by projecting the risk against COVID-19 transmission using Machine Learning Techniques

10.1101/2020.04.26.20080655 ◽

2020 ◽

Author(s):

Akshay Kumar ◽

Farhan Mohammad Khan ◽

Rajiv Gupta ◽

Harish Puppala

Keyword(s):

Machine Learning ◽

Gaussian Process Regression ◽

Machine Learning Techniques ◽

World Health ◽

Support Vector ◽

Learning Tools ◽

Learning Techniques ◽

Health Organization ◽

Respiratory Coronavirus ◽

Criticality Index

AbstractThe outbreak of COVID-19 is first identified in China, which later spread to various parts of the globe and was pronounced pandemic by the World Health Organization (WHO). The disease of transmissible person-to-person pneumonia caused by the extreme acute respiratory coronavirus 2 syndrome (SARS-COV-2, also known as COVID-19), has sparked a global warning. Thermal screening, quarantining, and later lockdown were methods employed by various nations to contain the spread of the virus. Though exercising various possible plans to contain the spread help in mitigating the effect of COVID-19, projecting the rise and preparing to face the crisis would help in minimizing the effect. In the scenario, this study attempts to use Machine Learning tools to forecast the possible rise in the number of cases by considering the data of daily new cases. To capture the uncertainty, three different techniques: (i) Decision Tree algorithm, (ii) Support Vector Machine algorithm, and (iii) Gaussian process regression are used to project the data and capture the possible deviation. Based on the projection of new cases, recovered cases, deceased cases, medical facilities, population density, number of tests conducted, and facilities of services, are considered to define the criticality index (CI). CI is used to classify all the districts of the country in the regions of high risk, low risk, and moderate risk. An online dashpot is created, which updates the data on daily bases for the next four weeks. The prospective suggestions of this study would aid in planning the strategies to apply the lockdown/ any other plan for any country, which can take other parameters to define the CI.

Download Full-text

Evaluating structural safety of trusses using Machine Learning

Frattura ed Integrità Strutturale ◽

10.3221/igf-esis.58.23 ◽

2021 ◽

Vol 15 (58) ◽

pp. 308-318

Author(s):

Tran-Hieu Nguyen ◽

Anh-Tuan Vu

Keyword(s):

Neural Network ◽

Machine Learning ◽

Support Vector Machine ◽

Roc Curve ◽

Deep Neural Network ◽

Structural Safety ◽

Training Dataset ◽

Support Vector ◽

Machine Model ◽

Adaptive Boosting

In this paper, a machine learning-based framework is developed to quickly evaluate the structural safety of trusses. Three numerical examples of a 10-bar truss, a 25-bar truss, and a 47-bar truss are used to illustrate the proposed framework. Firstly, several truss cases with different cross-sectional areas are generated by employing the Latin Hypercube Sampling method. Stresses inside truss members as well as displacements of nodes are determined through finite element analyses and obtained values are compared with design constraints. According to the constraint verification, the safety state is assigned as safe or unsafe. Members’ sectional areas and the safety state are stored as the inputs and outputs of the training dataset, respectively. Three popular machine learning classifiers including Support Vector Machine, Deep Neural Network, and Adaptive Boosting are used for evaluating the safety of structures. The comparison is conducted based on two metrics: the accuracy and the area under the ROC curve. For the two first examples, three classifiers get more than 90% of accuracy. For the 47-bar truss, the accuracies of the Support Vector Machine model and the Deep Neural Network model are lower than 70% but the Adaptive Boosting model still retains the high accuracy of approximately 98%. In terms of the area under the ROC curve, the comparative results are similar. Overall, the Adaptive Boosting model outperforms the remaining models. In addition, an investigation is carried out to show the influence of the parameters on the performance of the Adaptive Boosting model.

Download Full-text