A Bibliometric Analysis and Benchmark of Machine Learning and AutoML in Crash Severity Prediction: The Case Study of Three Colombian Cities

Traffic accidents are of worldwide concern, as they are one of the leading causes of death globally. One policy designed to cope with them is the design and deployment of road safety systems. These aim to predict crashes based on historical records, provided by new Internet of Things (IoT) technologies, to enhance traffic flow management and promote safer roads. Increasing data availability has helped machine learning (ML) to address the prediction of crashes and their severity. The literature reports numerous contributions regarding survey papers, experimental comparisons of various techniques, and the design of new methods at the point where crash severity prediction (CSP) and ML converge. Despite such progress, and as far as we know, there are no comprehensive research articles that theoretically and practically approach the model selection problem (MSP) in CSP. Thus, this paper introduces a bibliometric analysis and experimental benchmark of ML and automated machine learning (AutoML) as a suitable approach to automatically address the MSP in CSP. Firstly, 2318 bibliographic references were consulted to identify relevant authors, trending topics, keywords evolution, and the most common ML methods used in related-case studies, which revealed an opportunity for the use AutoML in the transportation field. Then, we compared AutoML (AutoGluon, Auto-sklearn, TPOT) and ML (CatBoost, Decision Tree, Extra Trees, Gradient Boosting, Gaussian Naive Bayes, Light Gradient Boosting Machine, Random Forest) methods in three case studies using open data portals belonging to the cities of Medellín, Bogotá, and Bucaramanga in Colombia. Our experimentation reveals that AutoGluon and CatBoost are competitive and robust ML approaches to deal with various CSP problems. In addition, we concluded that general-purpose AutoML effectively supports the MSP in CSP without developing domain-focused AutoML methods for this supervised learning problem. Finally, based on the results obtained, we introduce challenges and research opportunities that the community should explore to enhance the contributions that ML and AutoML can bring to CSP and other transportation areas.

Download Full-text

Development and validation of a difficult laryngoscopy prediction model using machine learning of neck circumference and thyromental height

BMC Anesthesiology ◽

10.1186/s12871-021-01343-4 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jong Ho Kim ◽

Haewon Kim ◽

Ji Su Jang ◽

Sung Mi Hwang ◽

So Young Lim ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Confidence Interval ◽

Neck Circumference ◽

Difficult Laryngoscopy ◽

Gradient Boosting ◽

Test Set ◽

Equal Distribution ◽

Light Gradient ◽

Extreme Gradient Boosting

Abstract Background Predicting difficult airway is challengeable in patients with limited airway evaluation. The aim of this study is to develop and validate a model that predicts difficult laryngoscopy by machine learning of neck circumference and thyromental height as predictors that can be used even for patients with limited airway evaluation. Methods Variables for prediction of difficulty laryngoscopy included age, sex, height, weight, body mass index, neck circumference, and thyromental distance. Difficult laryngoscopy was defined as Grade 3 and 4 by the Cormack-Lehane classification. The preanesthesia and anesthesia data of 1677 patients who had undergone general anesthesia at a single center were collected. The data set was randomly stratified into a training set (80%) and a test set (20%), with equal distribution of difficulty laryngoscopy. The training data sets were trained with five algorithms (logistic regression, multilayer perceptron, random forest, extreme gradient boosting, and light gradient boosting machine). The prediction models were validated through a test set. Results The model’s performance using random forest was best (area under receiver operating characteristic curve = 0.79 [95% confidence interval: 0.72–0.86], area under precision-recall curve = 0.32 [95% confidence interval: 0.27–0.37]). Conclusions Machine learning can predict difficult laryngoscopy through a combination of several predictors including neck circumference and thyromental height. The performance of the model can be improved with more data, a new variable and combination of models.

Download Full-text

Airplane Crash Severity Prediction Using Machine Learning

10.1109/icccnt51525.2021.9579711 ◽

2021 ◽

Author(s):

Jay Mehta ◽

Vaidehi Vatsaraj ◽

Jinal Shah ◽

Anand Godbole

Keyword(s):

Machine Learning ◽

Crash Severity ◽

Airplane Crash ◽

Severity Prediction

Download Full-text

Boosting Algorithm Choice in Predictive Machine Learning Models for Fracturing Applications

10.2118/205642-ms ◽

2021 ◽

Author(s):

Abdul Muqtadir Khan

Keyword(s):

Machine Learning ◽

Data Science ◽

Oil And Gas ◽

Oil And Gas Industry ◽

Injection Rate ◽

Model Construction ◽

Gradient Boosting ◽

Light Gradient ◽

Fracture Damage ◽

Boosting Technique

Abstract With the advancement in machine learning (ML) applications, some recent research has been conducted to optimize fracturing treatments. There are a variety of models available using various objective functions for optimization and different mathematical techniques. There is a need to extend the ML techniques to optimize the choice of algorithm. For fracturing treatment design, the literature for comparative algorithm performance is sparse. The research predominantly shows that compared to the most commonly used regressors and classifiers, some sort of boosting technique consistently outperforms on model testing and prediction accuracy. A database was constructed for a heterogeneous reservoir. Four widely used boosting algorithms were used on the database to predict the design only from the output of a short injection/falloff test. Feature importance analysis was done on eight output parameters from the falloff analysis, and six were finalized for the model construction. The outputs selected for prediction were fracturing fluid efficiency, proppant mass, maximum proppant concentration, and injection rate. Extreme gradient boost (XGBoost), categorical boost (CatBoost), adaptive boost (AdaBoost), and light gradient boosting machine (LGBM) were the algorithms finalized for the comparative study. The sensitivity was done for a different number of classes (four, five, and six) to establish a balance between accuracy and prediction granularity. The results showed that the best algorithm choice was between XGBoost and CatBoost for the predicted parameters under certain model construction conditions. The accuracy for all outputs for the holdout sets varied between 80 and 92%, showing robust significance for a wider utilization of these models. Data science has contributed to various oil and gas industry domains and has tremendous applications in the stimulation domain. The research and review conducted in this paper add a valuable resource for the user to build digital databases and use the appropriate algorithm without much trial and error. Implementing this model reduced the complexity of the proppant fracturing treatment redesign process, enhanced operational efficiency, and reduced fracture damage by eliminating minifrac steps with crosslinked gel.

Download Full-text

RegioML: Predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning

10.33774/chemrxiv-2021-l2fvl ◽

2021 ◽

Author(s):

Nicolai Ree ◽

Andreas H. Göller ◽

Jan H. Jensen

Keyword(s):

Machine Learning ◽

Tight Binding ◽

Reaction Centers ◽

Gradient Boosting ◽

Electrophilic Aromatic Substitution ◽

Aromatic Substitution ◽

Substitution Reactions ◽

Test Set ◽

Light Gradient ◽

Out Of Sample

We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution reactions. The model relies on CM5 atomic charges computed using semiempirical tight binding (GFN1-xTB) combined with the ensemble decision tree variant light gradient boosting machine (LightGBM). The model is trained and tested on 21,201 bromination reactions with 101K reaction centers, which is split into a training, test, and out-of-sample datasets with 58K, 15K, and 27K reaction centers, respectively. The accuracy is 93% for the test set and 90% for the out-of-sample set, while the precision (the percentage of positive predictions that are correct) is 88% and 80%, respectively. The test-set performance is very similar to the graph-based WLN method developed by Struble et al. (React. Chem. Eng. 2020, 5, 896) though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physics-based RegioSQM20 method (J. Cheminform. 2021, 13:10) where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20. The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. However, for reactions where there is little experimental data, physics-based approaches like RegioSQM20 can be used to generate synthetic data for model training. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML-model trained on RegioSQM20-generated data.

Download Full-text

Exploring the Mechanism of Crashes with Autonomous Vehicles Using Machine Learning

Mathematical Problems in Engineering ◽

10.1155/2021/5524356 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Hengrui Chen ◽

Hong Chen ◽

Ruiyu Zhou ◽

Zhizhen Liu ◽

Xiaoke Sun

Keyword(s):

Machine Learning ◽

Autonomous Vehicles ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Crash Severity ◽

Apriori Algorithm ◽

Driving Mode ◽

Extreme Gradient Boosting ◽

The Impact

The safety issue has become a critical obstacle that cannot be ignored in the marketization of autonomous vehicles (AVs). The objective of this study is to explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. We use the Apriori algorithm to explore the causal relationship between multiple factors to explore the mechanism of crashes. We use various machine learning models, including support vector machine (SVM), classification and regression tree (CART), and eXtreme Gradient Boosting (XGBoost), to analyze the crash severity. Besides, we apply the Shapley Additive Explanations (SHAP) to interpret the importance of each factor. The results indicate that XGBoost obtains the best result (recall = 75%; G-mean = 67.82%). Both XGBoost and Apriori algorithm effectively provided meaningful insights about AV-involved crash characteristics and their relationship. Among all these features, vehicle damage, weather conditions, accident location, and driving mode are the most critical features. We found that most rear-end crashes are conventional vehicles bumping into the rear of AVs. Drivers should be extremely cautious when driving in fog, snow, and insufficient light. Besides, drivers should be careful when driving near intersections, especially in the autonomous driving mode.

Download Full-text

Interpretable Machine Learning for Early Neurological Deterioration Prediction in Atrial Fibrillation-Related Stroke

10.21203/rs.3.rs-446890/v1 ◽

2021 ◽

Author(s):

Seong Hwan Kim ◽

Eun-Tae Jeon ◽

Sungwook Yu ◽

Kyungmi O ◽

Chi Kyung Kim ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Neurological Deterioration ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Interpretable Machine Learning ◽

Extreme Gradient Boosting ◽

Early Neurological Deterioration ◽

Feature Importance

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.

Download Full-text

Development of a Diabetes Melitus Detection and Prediction Model Using Light Gradient Boosting Machine and K-Nearest Neighbour

10.36108/ujees/1202.30.0160 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

B. A Omodunbi

Keyword(s):

Diabetes Mellitus ◽

Machine Learning ◽

Hybrid Model ◽

Learning Model ◽

Experimental Result ◽

Gradient Boosting ◽

Light Gradient ◽

Machine Learning Model ◽

Gradient Boosting Machine ◽

Receiver Operating

Diabetes mellitus is a health disorder that occurs when the blood sugar level becomes extremely high due to body resistance in producing the required amount of insulin. The aliment happens to be among the major causes of death in Nigeria and the world at large. This study was carried out to detect diabetes mellitus by developing a hybrid model that comprises of two machine learning model namely Light Gradient Boosting Machine (LGBM) and K-Nearest Neighbor (KNN). This research is aimed at developing a machine learning model for detecting the occurrence of diabetes in patients. The performance metrics employed in evaluating the finding for this study are Receiver Operating Characteristics (ROC) Curve, Five-fold Cross-validation, precision, and accuracy score. The proposed system had an accuracy of 91% and the area under the Receiver Operating Characteristic Curve was 93%. The experimental result shows that the prediction accuracy of the hybrid model is better than traditional machine learning

Download Full-text

Modeling of nitrogen solubility in normal alkanes using machine learning methods compared with cubic and PC-SAFT equations of state

Scientific Reports ◽

10.1038/s41598-021-03643-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Seyed Ali Madani ◽

Mohammad-Reza Mohammadi ◽

Saeid Atashrouz ◽

Ali Abedi ◽

Abdolhossein Hemmati-Sarapardeh ◽

...

Keyword(s):

Machine Learning ◽

Molecular Weight ◽

Oil Recovery ◽

Equations Of State ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Operating Pressure ◽

Normal Alkanes ◽

Light Gradient ◽

Extreme Gradient Boosting

AbstractAccurate prediction of the solubility of gases in hydrocarbons is a crucial factor in designing enhanced oil recovery (EOR) operations by gas injection as well as separation, and chemical reaction processes in a petroleum refinery. In this work, nitrogen (N2) solubility in normal alkanes as the major constituents of crude oil was modeled using five representative machine learning (ML) models namely gradient boosting with categorical features support (CatBoost), random forest, light gradient boosting machine (LightGBM), k-nearest neighbors (k-NN), and extreme gradient boosting (XGBoost). A large solubility databank containing 1982 data points was utilized to establish the models for predicting N2 solubility in normal alkanes as a function of pressure, temperature, and molecular weight of normal alkanes over broad ranges of operating pressure (0.0212–69.12 MPa) and temperature (91–703 K). The molecular weight range of normal alkanes was from 16 to 507 g/mol. Also, five equations of state (EOSs) including Redlich–Kwong (RK), Soave–Redlich–Kwong (SRK), Zudkevitch–Joffe (ZJ), Peng–Robinson (PR), and perturbed-chain statistical associating fluid theory (PC-SAFT) were used comparatively with the ML models to estimate N2 solubility in normal alkanes. Results revealed that the CatBoost model is the most precise model in this work with a root mean square error of 0.0147 and coefficient of determination of 0.9943. ZJ EOS also provided the best estimates for the N2 solubility in normal alkanes among the EOSs. Lastly, the results of relevancy factor analysis indicated that pressure has the greatest influence on N2 solubility in normal alkanes and the N2 solubility increases with increasing the molecular weight of normal alkanes.

Download Full-text

Integrated Model for COVID-19 Diagnosis Based on Computed Tomography AI, and Clinical Features: A Multicenter Cohort Study

10.21203/rs.3.rs-979599/v1 ◽

2021 ◽

Author(s):

Yuki Kataoka ◽

Yuya Kimura ◽

Tatsuyoshi Ikenoue ◽

Yoshinori Matsuoka ◽

Junji Kumasawa ◽

...

Keyword(s):

Machine Learning ◽

Computed Tomography ◽

Cohort Study ◽

Clinical Features ◽

Tertiary Care ◽

Gradient Boosting ◽

Diagnostic Model ◽

Full Model ◽

Light Gradient ◽

Better Than

Abstract Background We developed and validated a machine learning diagnostic model for novel coronavirus (COVID-19) disease, integrating artificial-intelligence-based computed tomography (CT) imaging and clinical features. Methods We conducted a retrospective cohort study in 11 Japanese tertiary care facilities that treated COVID-19 patients. Participants were tested using both real-time reverse transcription polymerase chain reaction (RT-PCR) and chest CT between January 1 and May 30, 2020. We chronologically split the dataset in each hospital into training and test sets, containing patients in a 7:3 ratio. Light Gradient Boosting Machine model was used for analysis. Results A total of 703 patients were included with two models — the full model and the A-blood model — developed for their diagnosis. The A-blood model included eight variables (the Ali-M3 confidence, along with seven clinical features of blood counts and biochemistry markers). The areas under the receiver-operator curve of both models (0.91, 95% confidence interval (CI), 0.86 to 0.95 for the full model and 0.90, 95% CI, 0.86 to 0.94 for the A-blood model) were better than that of the Ali-M3 confidence (0.78, 95% CI, 0.71 to 0.83) in the test set. Conclusions The A-blood model, a COVID-19 diagnostic model developed in this study, combines machine-learning and CT evaluation with blood test data and is better than the Ali-M3 framework existing for this purpose. This would significantly aid physicians in making a quicker diagnosis of COVID-19.

Download Full-text

Cubical homology-based Image Classification - A Comparative Study

10.36939/ir.202112231202 ◽

2021 ◽

Author(s):

◽

Seungho Choe

Keyword(s):

Machine Learning ◽

Image Classification ◽

Digital Image ◽

Persistent Homology ◽

Topological Data Analysis ◽

Connected Components ◽

Gradient Boosting ◽

Topological Features ◽

Light Gradient ◽

Cubical Homology

Persistent homology is a powerful tool in topological data analysis (TDA) to compute, study and encode efficiently multi-scale topological features and is being increasingly used in digital image classification. The topological features represent number of connected components, cycles, and voids that describe the shape of data. Persistent homology extracts the birth and death of these topological features through a filtration process. The lifespan of these features can represented using persistent diagrams (topological signatures). Cubical homology is a more efficient method for extracting topological features from a 2D image and uses a collection of cubes to compute the homology, which fits the digital image structure of grids. In this research, we propose a cubical homology-based algorithm for extracting topological features from 2D images to generate their topological signatures. Additionally, we propose a score, which measures the significance of each of the sub-simplices in terms of persistence. Also, gray level co-occurrence matrix (GLCM) and contrast limited adapting histogram equalization (CLAHE) are used as a supplementary method for extracting features. Machine learning techniques are then employed to classify images using the topological signatures. Among the eight tested algorithms with six published image datasets with varying pixel sizes, classes, and distributions, our experiments demonstrate that cubical homology-based machine learning with deep residual network (ResNet 1D) and Light Gradient Boosting Machine (lightGBM) shows promise with the extracted topological features.

Download Full-text