Gully Erosion Susceptibility Mapping in Highly Complex Terrain Using Machine Learning Models

Gully erosion is the most severe type of water erosion and is a major land degradation process. Gully erosion susceptibility mapping (GESM)’s efficiency and interpretability remains a challenge, especially in complex terrain areas. In this study, a WoE-MLC model was used to solve the above problem, which combines machine learning classification algorithms and the statistical weight of evidence (WoE) model in the Loess Plateau. The three machine learning (ML) algorithms utilized in this research were random forest (RF), gradient boosted decision trees (GBDT), and extreme gradient boosting (XGBoost). The results showed that: (1) GESM were well predicted by combining both machine learning regression models and WoE-MLC models, with the area under the curve (AUC) values both greater than 0.92, and the latter was more computationally efficient and interpretable; (2) The XGBoost algorithm was more efficient in GESM than the other two algorithms, with the strongest generalization ability and best performance in avoiding overfitting (averaged AUC = 0.947), followed by the RF algorithm (averaged AUC = 0.944), and GBDT algorithm (averaged AUC = 0.938); and (3) slope gradient, land use, and altitude were the main factors for GESM. This study may provide a possible method for gully erosion susceptibility mapping at large scale.

Download Full-text

Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework

Journal of Diabetes Research ◽

10.1155/2020/6873891 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Mingyue Xue ◽

Yinxia Su ◽

Chen Li ◽

Shuxia Wang ◽

Hua Yao

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Decision Tree ◽

Type Ii Diabetes ◽

Large Scale ◽

Systolic Pressure ◽

Gradient Boosting ◽

Significant Feature ◽

Type Ii ◽

Extreme Gradient Boosting

Background. An estimated 425 million people globally have diabetes, accounting for 12% of the world’s health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables’ importance scores of T2DM. Results. The results indicated that XGBoost had the best performance (accuracy=0.906, precision=0.910, recall=0.902, F‐1=0.906, and AUC=0.968). The degree of variables’ importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). Conclusions. We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables’ importance scores gives a clue to prevent diabetes occurrence.

Download Full-text

Comparison of Machine-Learning Algorithms for Near-Surface Air-Temperature Estimation from FY-4A AGRI Data

Advances in Meteorology ◽

10.1155/2020/8887364 ◽

2020 ◽

Vol 2020 ◽

pp. 1-14

Author(s):

Ke Zhou ◽

Hailei Liu ◽

Xiaobo Deng ◽

Hao Wang ◽

Shenglan Zhang

Keyword(s):

Machine Learning ◽

Air Temperature ◽

Large Scale ◽

Weather Prediction ◽

Surface Air Temperature ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Temporal And Spatial Distribution ◽

Near Surface ◽

Extreme Gradient Boosting

Six machine-learning approaches, including multivariate linear regression (MLR), gradient boosting decision tree, k-nearest neighbors, random forest, extreme gradient boosting (XGB), and deep neural network (DNN), were compared for near-surface air-temperature (Tair) estimation from the new generation of Chinese geostationary meteorological satellite Fengyun-4A (FY-4A) observations. The brightness temperatures in split-window channels from the Advanced Geostationary Radiation Imager (AGRI) of FY-4A and numerical weather prediction data from the global forecast system were used as the predictor variables for Tair estimation. The performance of each model and the temporal and spatial distribution of the estimated Tair errors were analyzed. The results showed that the XGB model had better overall performance, with R2 of 0.902, bias of −0.087°C, and root-mean-square error of 1.946°C. The spatial variation characteristics of the Tair error of the XGB method were less obvious than those of the other methods. The XGB model can provide more stable and high-precision Tair for a large-scale Tair estimation over China and can serve as a reference for Tair estimation based on machine-learning models.

Download Full-text

Prediction of the Disappearance of Companies From the Market in Bogotá, Colombia Using Machine Learning

Advances in Logistics, Operations, and Management Science - Handbook of Research on Management Techniques and Sustainability Strategies for Handling Disruptive Situations in Corporate Settings ◽

10.4018/978-1-7998-8185-8.ch011 ◽

2021 ◽

pp. 227-246

Author(s):

William Stive Fajardo-Moreno ◽

Rubén Dario Acosta Velásquez ◽

Ivan Dario Castaño Pérez ◽

Leonardo Espinosa-Leal

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Area Under The Curve ◽

Local Economy ◽

Gradient Boosting ◽

Grid Search ◽

Extreme Gradient Boosting ◽

Learning Machine ◽

Available Information ◽

Fold Cross Validation

In this chapter, the results concerning the modeling of companies' disappearance from Bogota's market using machine learning methods are presented. The authors use the available information from Bogota's Chamber of Commerce, where the companies are registered yearly. The dataset comprises the years 2017 to 2020 with almost 3 million registries. In this work, a deep analysis of the different features of the data is presented and explained. Next, four state-of-the-art machine learning models are trained for comparison: logistic regression (LR), extreme learning machine (ELM), random forest (RF), and extreme gradient boosting (XGBoost), all with five-fold cross-validation and 50 steps in the randomized grid search. All methods showed excellent performance, with an average of 0.895 in the area under the curve (AUC), being the latter algorithm the best overall (0.97). These results are in agreement with the state-of-the-art values in the field and will be of paramount importance to assess companies' stability for Bogota's local economy.

Download Full-text

MRI Radiomic Features to Predict IDH1 Mutation Status in Gliomas: A Machine Learning Approach using Gradient Tree Boosting

International Journal of Molecular Sciences ◽

10.3390/ijms21218004 ◽

2020 ◽

Vol 21 (21) ◽

pp. 8004

Author(s):

Yu Sakai ◽

Chen Yang ◽

Shingo Kihira ◽

Nadejda Tsankova ◽

Fahad Khan ◽

...

Keyword(s):

Machine Learning ◽

Characteristic Curve ◽

Area Under The Curve ◽

Prognostic Indicator ◽

Idh1 Mutation ◽

Gradient Boosting ◽

Isocitrate Dehydrogenase 1 ◽

Test Set ◽

Mutation Status ◽

Extreme Gradient Boosting

In patients with gliomas, isocitrate dehydrogenase 1 (IDH1) mutation status has been studied as a prognostic indicator. Recent advances in machine learning (ML) have demonstrated promise in utilizing radiomic features to study disease processes in the brain. We investigate whether ML analysis of multiparametric radiomic features from preoperative Magnetic Resonance Imaging (MRI) can predict IDH1 mutation status in patients with glioma. This retrospective study included patients with glioma with known IDH1 status and preoperative MRI. Radiomic features were extracted from Fluid-Attenuated Inversion Recovery (FLAIR) and Diffusion-Weighted-Imaging (DWI). The dataset was split into training, validation, and testing sets by stratified sampling. Synthetic Minority Oversampling Technique (SMOTE) was applied to the training sets. eXtreme Gradient Boosting (XGBoost) classifiers were trained, and the hyperparameters were tuned. Receiver operating characteristic curve (ROC), accuracy, and f1-scores were collected. A total of 100 patients (age: 55 ± 15, M/F 60/40); with IDH1 mutant (n = 22) and IDH1 wildtype (n = 78) were included. The best performance was seen with a DWI-trained XGBoost model, which achieved ROC with Area Under the Curve (AUC) of 0.97, accuracy of 0.90, and f1-score of 0.75 on the test set. The FLAIR-trained XGBoost model achieved ROC with AUC of 0.95, accuracy of 0.90, f1-score of 0.75 on the test set. A model that was trained on combined FLAIR-DWI radiomic features did not provide incremental accuracy. The results show that a XGBoost classifier using multiparametric radiomic features derived from preoperative MRI can predict IDH1 mutation status with > 90% accuracy.

Download Full-text

A Study of Machine-Learning Classifiers for Hypertension Based on Radial Pulse Wave

BioMed Research International ◽

10.1155/2018/2964816 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Zhi-yu Luo ◽

Ji Cui ◽

Xiao-juan Hu ◽

Li-ping Tu ◽

Hai-dan Liu ◽

...

Keyword(s):

Machine Learning ◽

Chinese Medicine ◽

Pulse Wave ◽

Dynamic Change ◽

Area Under The Curve ◽

Research Direction ◽

Disease Diagnosis ◽

Gradient Boosting ◽

Machine Learning Classification ◽

Digital Pulse

Objective. In this study, machine learning was utilized to classify and predict pulse wave of hypertensive group and healthy group and assess the risk of hypertension by observing the dynamic change of the pulse wave and provide an objective reference for clinical application of pulse diagnosis in traditional Chinese medicine (TCM). Method. The basic information from 450 hypertensive cases and 479 healthy cases was collected by self-developed H20 questionnaires and pulse wave information was acquired by self-developed pulse diagnostic instrument (PDA-1). H20 questionnaires and pulse wave information were used as input variables to obtain different machine learning classification models of hypertension. This method was aimed at analyzing the influence of pulse wave on the accuracy and stability of machine learning model, as well as the feature contribution of hypertension model after removing noise by K-means. Result. Compared with the classification results before removing noise, the accuracy and the area under the curve (AUC) had been improved. The accuracy rates of AdaBoost, Gradient Boosting, and Random Forest (RF) were 86.41%, 86.41%, and 85.33%, respectively. AUC were 0.86, 0.86, and 0.85, respectively. The maximum accuracy of SVM increased from 79.57% to 83.15%, and the AUC stability increased from 0.79 to 0.83. In addition, the features of importance on traditional statistics and machine learning were consistent. After removing noise, the features with large changes were h1/t1, w1/t, t, w2, h2, t1, and t5 in AdaBoost and Gradient Boosting (top10). The common variables for machine learning and traditional statistics were h1/t1, h5, t, Ad, BMI, and t2. Conclusion. Pulse wave-based diagnostic method of hypertension has significant value in reference. In view of the feasibility of digital-pulse-wave diagnosis and dynamically evaluating hypertension, it provides the research direction and foundation for Chinese medicine in the dynamic evaluation of modern disease diagnosis and curative effect.

Download Full-text

Machine Learning Models for COVID-19 Detection in Brazil Based on Symptoms (Preprint)

10.2196/preprints.27293 ◽

2021 ◽

Author(s):

Íris Viana dos Santos Santana ◽

Andressa C. M. da Silveira ◽

Álvaro Sobrinho ◽

Lenardo Chaves e Silva ◽

Leandro Dias da Silva ◽

...

Keyword(s):

Machine Learning ◽

Early Stage ◽

Area Under The Curve ◽

Supervised Machine Learning ◽

Gradient Boosting ◽

Support Vector ◽

Accuracy Score ◽

K Nearest Neighbors ◽

Runny Nose ◽

Extreme Gradient Boosting

BACKGROUND controlling the COVID-19 outbreak in Brazil is considered a challenge of continental proportions due to the high population and urban density, weak implementation and maintenance of social distancing strategies, and limited testing capabilities. OBJECTIVE to contribute to addressing such a challenge, we present the implementation and evaluation of supervised Machine Learning (ML) models to assist the COVID-19 detection in Brazil based on early-stage symptoms. METHODS firstly, we conducted data preprocessing and applied the Chi-squared test in a Brazilian dataset, mainly composed of early-stage symptoms, to perform statistical analyses. Afterward, we implemented ML models using the Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), K-Nearest Neighbors (KNN), Decision Tree (DT), Gradient Boosting Machine (GBM), and Extreme Gradient Boosting (XGBoost) algorithms. We evaluated the ML models using precision, accuracy score, recall, the area under the curve, and the Friedman and Nemenyi tests. Based on the comparison, we grouped the top five ML models and measured feature importance. RESULTS the MLP model presented the highest mean accuracy score, with more than 97.85%, when compared to GBM (> 97.39%), RF (> 97.36%), DT (> 97.07%), XGBoost (> 97.06%), KNN (> 95.14%), and SVM (> 94.27%). Based on the statistical comparison, we grouped MLP, GBM, DT, RF, and XGBoost, as the top five ML models, because the evaluation results are statistically indistinguishable. The ML models` importance of features used during predictions varies from gender, profession, fever, sore throat, dyspnea, olfactory disorder, cough, runny nose, taste disorder, and headache. CONCLUSIONS supervised ML models effectively assist the decision making in medical diagnosis and public administration (e.g., testing strategies), based on early-stage symptoms that do not require advanced and expensive exams.

Download Full-text

Intercomparing the robustness of machine learning models in simulation and forecasting of streamflow

Journal of Water and Climate Change ◽

10.2166/wcc.2020.365 ◽

2020 ◽

Author(s):

Parthiban Loganathan ◽

Amit Baburao Mahindrakar

Keyword(s):

Machine Learning ◽

Large Scale ◽

Resource Planning ◽

Machine Learning Techniques ◽

Low Flow ◽

Gradient Boosting ◽

Daily Streamflow ◽

Learning Techniques ◽

Extreme Gradient Boosting ◽

Hydrological Indices

Abstract The intercomparison of streamflow simulation and the prediction of discharge using various renowned machine learning techniques were performed. The daily streamflow discharge model was developed for 35 observation stations located in a large-scale river basin named Cauvery. Various hydrological indices were calculated for observed and predicted discharges for comparing and evaluating the replicability of local hydrological conditions. The model variance and bias observed from the proposed extreme gradient boosting decision tree model were less than 15%, which is compared with other machine learning techniques considered in this study. The model Nash–Sutcliffe efficiency and coefficient of determination values are above 0.7 for both the training and testing phases which demonstrate the effectiveness of model performance. The comparison of monthly observed and model-predicted discharges during the validation period illustrates the model's ability in representing the peaks and fall in high-, medium-, and low-flow zones. The assessment and comparison of hydrological indices between observed and predicted discharges illustrate the model's ability in representing the baseflow, high-spell, and low-spell statistics. Simulating streamflow and predicting discharge are essential for water resource planning and management, especially in large-scale river basins. The proposed machine learning technique demonstrates significant improvement in model efficiency by dropping variance and bias which, in turn, improves the replicability of local-scale hydrology.

Download Full-text

Development of Machine Learning Strategy for Predicting the Risk Range of Ship’s Berthing Velocity

Journal of Marine Science and Engineering ◽

10.3390/jmse8050376 ◽

2020 ◽

Vol 8 (5) ◽

pp. 376

Author(s):

Hyeong-Tak Lee ◽

Jeong-Seok Lee ◽

Woo-Ju Son ◽

Ik-Soon Cho

Keyword(s):

Machine Learning ◽

Learning Strategy ◽

Characteristic Curve ◽

Confusion Matrix ◽

Area Under The Curve ◽

Gradient Boosting ◽

Classification Algorithms ◽

Factors Affecting ◽

Machine Learning Classification ◽

The Republic

Ships are prone to accidents when approaching in a berthing velocity greater than that allowed when determining the risk range corresponding to a port. Therefore, this study develops a machine learning strategy to predict the risk range of an unsafe berthing velocity when the ship approaches in port. To perform analysis, the input parameters were based on the factors affecting the berthing velocity, and the output parameter, i.e., the berthing velocity, was measured at a tanker terminal in the Republic of Korea. Nine machine learning classification algorithms were used to analyze each model, and the top four optimal models were selected through evaluation methods based on the confusion matrix. As a result of the analysis, extra trees, random forest, bagging, and gradient boosting classifiers were identified as good models. As a result of testing using the receiving operator characteristic curve, it was confirmed that the area under the curve of the most dangerous range of berthing velocity was the highest, thus, the risk range was appropriately classified. As such, the derived models can classify and predict the risk range of unsafe berthing velocity before approaching a port; therefore, it is possible to safely berth a ship.

Download Full-text

An Autoencoder and Machine Learning Model to Predict Suicidal Ideation with Brain Structural Imaging

Journal of Clinical Medicine ◽

10.3390/jcm9030658 ◽

2020 ◽

Vol 9 (3) ◽

pp. 658 ◽

Cited By ~ 1

Author(s):

Jun-Cheng Weng ◽

Tung-Yeh Lin ◽

Yuan-Hsiung Tsai ◽

Man Teng Cheok ◽

Yi-Peng Eve Chang ◽

...

Keyword(s):

Machine Learning ◽

Suicidal Ideation ◽

Learning Algorithm ◽

Area Under The Curve ◽

Learning Model ◽

Supervised Machine Learning ◽

Gradient Boosting ◽

Machine Learning Model ◽

Extreme Gradient Boosting ◽

Depressive Patients

It is estimated that at least one million people die by suicide every year, showing the importance of suicide prevention and detection. In this study, an autoencoder and machine learning model was employed to predict people with suicidal ideation based on their structural brain imaging. The subjects in our generalized q-sampling imaging (GQI) dataset consisted of three groups: 41 depressive patients with suicidal ideation (SI), 54 depressive patients without suicidal thoughts (NS), and 58 healthy controls (HC). In the GQI dataset, indices of generalized fractional anisotropy (GFA), isotropic values of the orientation distribution function (ISO), and normalized quantitative anisotropy (NQA) were separately trained in different machine learning models. A convolutional neural network (CNN)-based autoencoder model, the supervised machine learning algorithm extreme gradient boosting (XGB), and logistic regression (LR) were used to discriminate SI subjects from NS and HC subjects. After five-fold cross validation, separate data were tested to obtain the accuracy, sensitivity, specificity, and area under the curve of each result. Our results showed that the best pattern of structure across multiple brain locations can classify suicidal ideates from NS and HC with a prediction accuracy of 85%, a specificity of 100% and a sensitivity of 75%. The algorithms developed here might provide an objective tool to help identify suicidal ideation risk among depressed patients alongside clinical assessment.

Download Full-text

Importance of GWAS risk loci and clinical data in predicting asthma using machine-learning approaches

10.21203/rs.3.rs-21271/v1 ◽

2020 ◽

Author(s):

Si-Qiao Liang ◽

Jian-Xiong Long ◽

Jingmin Deng ◽

Xuan Wei ◽

Mei-Ling Yang ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Clinical Data ◽

Genome Wide Association Study ◽

Prediction Models ◽

Area Under The Curve ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Abstract Asthma is a serious immune-mediated respiratory airway disease. Its pathological processes involve genetics and the environment, but it remains unclear. To understand the risk factors of asthma, we combined genome-wide association study (GWAS) risk loci and clinical data in predicting asthma using machine-learning approaches. A case–control study with 123 asthma patients and 100 healthy controls was conducted in Zhuang population in Guangxi. GWAS risk loci were detected using polymerase chain reaction, and clinical data were collected. Machine-learning approaches (e.g., extreme gradient boosting [XGBoost], decision tree, support vector machine, and random forest algorithms) were used to identify the major factors that contributed to asthma. A total of 14 GWAS risk loci with clinical data were analyzed on the basis of 10 times of 10-fold cross-validation for all machine-learning models. Using GWAS risk loci or clinical data, the best performances were area under the curve (AUC) values of 64.3% and 71.4%, respectively. Combining GWAS risk loci and clinical data, the XGBoost established the best model with an AUC of 79.7%, indicating that the combination of genetics and clinical data can enable improved performance. We then sorted the importance of features and found that the top six risk factors for predicting asthma were rs3117098, rs7775228, family history, rs2305480, rs4833095, and body mass index. Asthma-prediction models based on GWAS risk loci and clinical data can accurately predict asthma and thus provide insights into the disease pathogenesis of asthma. Further research is required to evaluate more genetic markers and clinical data and predict asthma risk.

Download Full-text