Cubical homology-based Image Classification - A Comparative Study

Mapping Intimacies ◽

10.36939/ir.202112231202 ◽

2021 ◽

Author(s):

◽

Seungho Choe

Keyword(s):

Machine Learning ◽

Image Classification ◽

Digital Image ◽

Persistent Homology ◽

Topological Data Analysis ◽

Connected Components ◽

Gradient Boosting ◽

Topological Features ◽

Light Gradient ◽

Cubical Homology

Persistent homology is a powerful tool in topological data analysis (TDA) to compute, study and encode efficiently multi-scale topological features and is being increasingly used in digital image classification. The topological features represent number of connected components, cycles, and voids that describe the shape of data. Persistent homology extracts the birth and death of these topological features through a filtration process. The lifespan of these features can represented using persistent diagrams (topological signatures). Cubical homology is a more efficient method for extracting topological features from a 2D image and uses a collection of cubes to compute the homology, which fits the digital image structure of grids. In this research, we propose a cubical homology-based algorithm for extracting topological features from 2D images to generate their topological signatures. Additionally, we propose a score, which measures the significance of each of the sub-simplices in terms of persistence. Also, gray level co-occurrence matrix (GLCM) and contrast limited adapting histogram equalization (CLAHE) are used as a supplementary method for extracting features. Machine learning techniques are then employed to classify images using the topological signatures. Among the eight tested algorithms with six published image datasets with varying pixel sizes, classes, and distributions, our experiments demonstrate that cubical homology-based machine learning with deep residual network (ResNet 1D) and Light Gradient Boosting Machine (lightGBM) shows promise with the extracted topological features.

Download Full-text

Limiting Machine Learning Overfitting Uncertainties Through Persistent Homology

ASME 2020 Verification and Validation Symposium ◽

10.1115/vvs2020-8833 ◽

2020 ◽

Author(s):

Kyle Haas

Keyword(s):

Machine Learning ◽

Persistent Homology ◽

Training Data ◽

Machine Learning Techniques ◽

Topological Data Analysis ◽

Connected Components ◽

Learning Tools ◽

Learning Models ◽

Verification Methods ◽

Machine Learning Models

Abstract Machine learning techniques are powerful predictive tools that continue to gain prominence with increasingly available computational power. Engineering design professionals often view machine learning tools through a skeptical lens due to their perceived detachment from the underlying physics. Machine learning tools, such as artificial neural networks and regression models, are fueled by training data, obtained either analytically or through physical collection, the use of such surrogate models introduces an additional source of uncertainty. Sources of uncertainty associated with machine learning models can originate from the collected data or from the training process itself. Validation and verification methods are especially important for machine learning applications due to their perceived disconnect from underlying physics, sensitivity to data accumulation uncertainties, and potential for under or over training the model itself. Despite all of these potential pitfalls, sufficient testing of machine learning models against segregated testing data and use of regularization tools to diagnose overfitting is not always employed by industry practioners. This paper will illustrate the use of topological data analysis (TDA), specifically persistent homology, a subset of algebraic topology, as an alternative means to achieve generalization of a predictive manifold produced through a machine learning model. Persistent homology will be used to seek out and identify the most meaningful and connected components within the data that forms the predicted manifold, with less connected components treated as noise to be disregarded. Therefore, the uncertainties associated with overfitting can be limited. The proposed method will be demonstrated through its application to a simple single-degree-of-freedom structural system to demonstrate its effectiveness in generalizing the resulting manifold and limiting the associated uncertainty.

Download Full-text

Classification of apatite structures via topological data analysis: a framework for a ‘Materials Barcode’ representation of structure maps

Scientific Reports ◽

10.1038/s41598-021-90070-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Scott Broderick ◽

Ruhil Dongol ◽

Tianmu Zhang ◽

Krishna Rajan

Keyword(s):

Machine Learning ◽

Data Analysis ◽

Crystal Chemistry ◽

Persistent Homology ◽

Hierarchical Classification ◽

Topological Data Analysis ◽

Learning Tool ◽

Coordination Polyhedra ◽

Machine Learning Tool ◽

Topological Data

AbstractThis paper introduces the use of topological data analysis (TDA) as an unsupervised machine learning tool to uncover classification criteria in complex inorganic crystal chemistries. Using the apatite chemistry as a template, we track through the use of persistent homology the topological connectivity of input crystal chemistry descriptors on defining similarity between different stoichiometries of apatites. It is shown that TDA automatically identifies a hierarchical classification scheme within apatites based on the commonality of the number of discrete coordination polyhedra that constitute the structural building units common among the compounds. This information is presented in the form of a visualization scheme of a barcode of homology classifications, where the persistence of similarity between compounds is tracked. Unlike traditional perspectives of structure maps, this new “Materials Barcode” schema serves as an automated exploratory machine learning tool that can uncover structural associations from crystal chemistry databases, as well as to achieve a more nuanced insight into what defines similarity among homologous compounds.

Download Full-text

Development and validation of a difficult laryngoscopy prediction model using machine learning of neck circumference and thyromental height

BMC Anesthesiology ◽

10.1186/s12871-021-01343-4 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jong Ho Kim ◽

Haewon Kim ◽

Ji Su Jang ◽

Sung Mi Hwang ◽

So Young Lim ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Confidence Interval ◽

Neck Circumference ◽

Difficult Laryngoscopy ◽

Gradient Boosting ◽

Test Set ◽

Equal Distribution ◽

Light Gradient ◽

Extreme Gradient Boosting

Abstract Background Predicting difficult airway is challengeable in patients with limited airway evaluation. The aim of this study is to develop and validate a model that predicts difficult laryngoscopy by machine learning of neck circumference and thyromental height as predictors that can be used even for patients with limited airway evaluation. Methods Variables for prediction of difficulty laryngoscopy included age, sex, height, weight, body mass index, neck circumference, and thyromental distance. Difficult laryngoscopy was defined as Grade 3 and 4 by the Cormack-Lehane classification. The preanesthesia and anesthesia data of 1677 patients who had undergone general anesthesia at a single center were collected. The data set was randomly stratified into a training set (80%) and a test set (20%), with equal distribution of difficulty laryngoscopy. The training data sets were trained with five algorithms (logistic regression, multilayer perceptron, random forest, extreme gradient boosting, and light gradient boosting machine). The prediction models were validated through a test set. Results The model’s performance using random forest was best (area under receiver operating characteristic curve = 0.79 [95% confidence interval: 0.72–0.86], area under precision-recall curve = 0.32 [95% confidence interval: 0.27–0.37]). Conclusions Machine learning can predict difficult laryngoscopy through a combination of several predictors including neck circumference and thyromental height. The performance of the model can be improved with more data, a new variable and combination of models.

Download Full-text

Boosting Algorithm Choice in Predictive Machine Learning Models for Fracturing Applications

10.2118/205642-ms ◽

2021 ◽

Author(s):

Abdul Muqtadir Khan

Keyword(s):

Machine Learning ◽

Data Science ◽

Oil And Gas ◽

Oil And Gas Industry ◽

Injection Rate ◽

Model Construction ◽

Gradient Boosting ◽

Light Gradient ◽

Fracture Damage ◽

Boosting Technique

Abstract With the advancement in machine learning (ML) applications, some recent research has been conducted to optimize fracturing treatments. There are a variety of models available using various objective functions for optimization and different mathematical techniques. There is a need to extend the ML techniques to optimize the choice of algorithm. For fracturing treatment design, the literature for comparative algorithm performance is sparse. The research predominantly shows that compared to the most commonly used regressors and classifiers, some sort of boosting technique consistently outperforms on model testing and prediction accuracy. A database was constructed for a heterogeneous reservoir. Four widely used boosting algorithms were used on the database to predict the design only from the output of a short injection/falloff test. Feature importance analysis was done on eight output parameters from the falloff analysis, and six were finalized for the model construction. The outputs selected for prediction were fracturing fluid efficiency, proppant mass, maximum proppant concentration, and injection rate. Extreme gradient boost (XGBoost), categorical boost (CatBoost), adaptive boost (AdaBoost), and light gradient boosting machine (LGBM) were the algorithms finalized for the comparative study. The sensitivity was done for a different number of classes (four, five, and six) to establish a balance between accuracy and prediction granularity. The results showed that the best algorithm choice was between XGBoost and CatBoost for the predicted parameters under certain model construction conditions. The accuracy for all outputs for the holdout sets varied between 80 and 92%, showing robust significance for a wider utilization of these models. Data science has contributed to various oil and gas industry domains and has tremendous applications in the stimulation domain. The research and review conducted in this paper add a valuable resource for the user to build digital databases and use the appropriate algorithm without much trial and error. Implementing this model reduced the complexity of the proppant fracturing treatment redesign process, enhanced operational efficiency, and reduced fracture damage by eliminating minifrac steps with crosslinked gel.

Download Full-text

RegioML: Predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning

10.33774/chemrxiv-2021-l2fvl ◽

2021 ◽

Author(s):

Nicolai Ree ◽

Andreas H. Göller ◽

Jan H. Jensen

Keyword(s):

Machine Learning ◽

Tight Binding ◽

Reaction Centers ◽

Gradient Boosting ◽

Electrophilic Aromatic Substitution ◽

Aromatic Substitution ◽

Substitution Reactions ◽

Test Set ◽

Light Gradient ◽

Out Of Sample

We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution reactions. The model relies on CM5 atomic charges computed using semiempirical tight binding (GFN1-xTB) combined with the ensemble decision tree variant light gradient boosting machine (LightGBM). The model is trained and tested on 21,201 bromination reactions with 101K reaction centers, which is split into a training, test, and out-of-sample datasets with 58K, 15K, and 27K reaction centers, respectively. The accuracy is 93% for the test set and 90% for the out-of-sample set, while the precision (the percentage of positive predictions that are correct) is 88% and 80%, respectively. The test-set performance is very similar to the graph-based WLN method developed by Struble et al. (React. Chem. Eng. 2020, 5, 896) though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physics-based RegioSQM20 method (J. Cheminform. 2021, 13:10) where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20. The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. However, for reactions where there is little experimental data, physics-based approaches like RegioSQM20 can be used to generate synthetic data for model training. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML-model trained on RegioSQM20-generated data.

Download Full-text

Interpretable Machine Learning for Early Neurological Deterioration Prediction in Atrial Fibrillation-Related Stroke

10.21203/rs.3.rs-446890/v1 ◽

2021 ◽

Author(s):

Seong Hwan Kim ◽

Eun-Tae Jeon ◽

Sungwook Yu ◽

Kyungmi O ◽

Chi Kyung Kim ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Neurological Deterioration ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Interpretable Machine Learning ◽

Extreme Gradient Boosting ◽

Early Neurological Deterioration ◽

Feature Importance

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.

Download Full-text

Development of a Diabetes Melitus Detection and Prediction Model Using Light Gradient Boosting Machine and K-Nearest Neighbour

10.36108/ujees/1202.30.0160 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

B. A Omodunbi

Keyword(s):

Diabetes Mellitus ◽

Machine Learning ◽

Hybrid Model ◽

Learning Model ◽

Experimental Result ◽

Gradient Boosting ◽

Light Gradient ◽

Machine Learning Model ◽

Gradient Boosting Machine ◽

Receiver Operating

Diabetes mellitus is a health disorder that occurs when the blood sugar level becomes extremely high due to body resistance in producing the required amount of insulin. The aliment happens to be among the major causes of death in Nigeria and the world at large. This study was carried out to detect diabetes mellitus by developing a hybrid model that comprises of two machine learning model namely Light Gradient Boosting Machine (LGBM) and K-Nearest Neighbor (KNN). This research is aimed at developing a machine learning model for detecting the occurrence of diabetes in patients. The performance metrics employed in evaluating the finding for this study are Receiver Operating Characteristics (ROC) Curve, Five-fold Cross-validation, precision, and accuracy score. The proposed system had an accuracy of 91% and the area under the Receiver Operating Characteristic Curve was 93%. The experimental result shows that the prediction accuracy of the hybrid model is better than traditional machine learning

Download Full-text

Bot Detection on Social Networks Using Persistent Homology

Mathematical and Computational Applications ◽

10.3390/mca25030058 ◽

2020 ◽

Vol 25 (3) ◽

pp. 58

Author(s):

Minh Nguyen ◽

Mehmet Aktas ◽

Esra Akbas

Keyword(s):

Machine Learning ◽

Social Networks ◽

Social Media ◽

Persistent Homology ◽

Structural Features ◽

Higher Order ◽

Ego Networks ◽

Feature Extraction Method ◽

Topological Features ◽

Bot Detection

The growth of social media in recent years has contributed to an ever-increasing network of user data in every aspect of life. This volume of generated data is becoming a vital asset for the growth of companies and organizations as a powerful tool to gain insights and make crucial decisions. However, data is not always reliable, since primarily, it can be manipulated and disseminated from unreliable sources. In the field of social network analysis, this problem can be tackled by implementing machine learning models that can learn to classify between humans and bots, which are mostly harmful computer programs exploited to shape public opinions and circulate false information on social media. In this paper, we propose a novel topological feature extraction method for bot detection on social networks. We first create weighted ego networks of each user. We then encode the higher-order topological features of ego networks using persistent homology. Finally, we use these extracted features to train a machine learning model and use that model to classify users as bot vs. human. Our experimental results suggest that using the higher-order topological features coming from persistent homology is promising in bot detection and more effective than using classical graph-theoretic structural features.

Download Full-text

Modeling of nitrogen solubility in normal alkanes using machine learning methods compared with cubic and PC-SAFT equations of state

Scientific Reports ◽

10.1038/s41598-021-03643-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Seyed Ali Madani ◽

Mohammad-Reza Mohammadi ◽

Saeid Atashrouz ◽

Ali Abedi ◽

Abdolhossein Hemmati-Sarapardeh ◽

...

Keyword(s):

Machine Learning ◽

Molecular Weight ◽

Oil Recovery ◽

Equations Of State ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Operating Pressure ◽

Normal Alkanes ◽

Light Gradient ◽

Extreme Gradient Boosting

AbstractAccurate prediction of the solubility of gases in hydrocarbons is a crucial factor in designing enhanced oil recovery (EOR) operations by gas injection as well as separation, and chemical reaction processes in a petroleum refinery. In this work, nitrogen (N2) solubility in normal alkanes as the major constituents of crude oil was modeled using five representative machine learning (ML) models namely gradient boosting with categorical features support (CatBoost), random forest, light gradient boosting machine (LightGBM), k-nearest neighbors (k-NN), and extreme gradient boosting (XGBoost). A large solubility databank containing 1982 data points was utilized to establish the models for predicting N2 solubility in normal alkanes as a function of pressure, temperature, and molecular weight of normal alkanes over broad ranges of operating pressure (0.0212–69.12 MPa) and temperature (91–703 K). The molecular weight range of normal alkanes was from 16 to 507 g/mol. Also, five equations of state (EOSs) including Redlich–Kwong (RK), Soave–Redlich–Kwong (SRK), Zudkevitch–Joffe (ZJ), Peng–Robinson (PR), and perturbed-chain statistical associating fluid theory (PC-SAFT) were used comparatively with the ML models to estimate N2 solubility in normal alkanes. Results revealed that the CatBoost model is the most precise model in this work with a root mean square error of 0.0147 and coefficient of determination of 0.9943. ZJ EOS also provided the best estimates for the N2 solubility in normal alkanes among the EOSs. Lastly, the results of relevancy factor analysis indicated that pressure has the greatest influence on N2 solubility in normal alkanes and the N2 solubility increases with increasing the molecular weight of normal alkanes.

Download Full-text

Integrated Model for COVID-19 Diagnosis Based on Computed Tomography AI, and Clinical Features: A Multicenter Cohort Study

10.21203/rs.3.rs-979599/v1 ◽

2021 ◽

Author(s):

Yuki Kataoka ◽

Yuya Kimura ◽

Tatsuyoshi Ikenoue ◽

Yoshinori Matsuoka ◽

Junji Kumasawa ◽

...

Keyword(s):

Machine Learning ◽

Computed Tomography ◽

Cohort Study ◽

Clinical Features ◽

Tertiary Care ◽

Gradient Boosting ◽

Diagnostic Model ◽

Full Model ◽

Light Gradient ◽

Better Than

Abstract Background We developed and validated a machine learning diagnostic model for novel coronavirus (COVID-19) disease, integrating artificial-intelligence-based computed tomography (CT) imaging and clinical features. Methods We conducted a retrospective cohort study in 11 Japanese tertiary care facilities that treated COVID-19 patients. Participants were tested using both real-time reverse transcription polymerase chain reaction (RT-PCR) and chest CT between January 1 and May 30, 2020. We chronologically split the dataset in each hospital into training and test sets, containing patients in a 7:3 ratio. Light Gradient Boosting Machine model was used for analysis. Results A total of 703 patients were included with two models — the full model and the A-blood model — developed for their diagnosis. The A-blood model included eight variables (the Ali-M3 confidence, along with seven clinical features of blood counts and biochemistry markers). The areas under the receiver-operator curve of both models (0.91, 95% confidence interval (CI), 0.86 to 0.95 for the full model and 0.90, 95% CI, 0.86 to 0.94 for the A-blood model) were better than that of the Ali-M3 confidence (0.78, 95% CI, 0.71 to 0.83) in the test set. Conclusions The A-blood model, a COVID-19 diagnostic model developed in this study, combines machine-learning and CT evaluation with blood test data and is better than the Ali-M3 framework existing for this purpose. This would significantly aid physicians in making a quicker diagnosis of COVID-19.

Download Full-text