stochastic gradient boosting
Recently Published Documents


TOTAL DOCUMENTS

103
(FIVE YEARS 60)

H-INDEX

16
(FIVE YEARS 5)

2022 ◽  
Vol 4 ◽  
Author(s):  
Matthew D. Stocker ◽  
Yakov A. Pachepsky ◽  
Robert L. Hill

The microbial quality of irrigation water is an important issue as the use of contaminated waters has been linked to several foodborne outbreaks. To expedite microbial water quality determinations, many researchers estimate concentrations of the microbial contamination indicator Escherichia coli (E. coli) from the concentrations of physiochemical water quality parameters. However, these relationships are often non-linear and exhibit changes above or below certain threshold values. Machine learning (ML) algorithms have been shown to make accurate predictions in datasets with complex relationships. The purpose of this work was to evaluate several ML models for the prediction of E. coli in agricultural pond waters. Two ponds in Maryland were monitored from 2016 to 2018 during the irrigation season. E. coli concentrations along with 12 other water quality parameters were measured in water samples. The resulting datasets were used to predict E. coli using stochastic gradient boosting (SGB) machines, random forest (RF), support vector machines (SVM), and k-nearest neighbor (kNN) algorithms. The RF model provided the lowest RMSE value for predicted E. coli concentrations in both ponds in individual years and over consecutive years in almost all cases. For individual years, the RMSE of the predicted E. coli concentrations (log10 CFU 100 ml−1) ranged from 0.244 to 0.346 and 0.304 to 0.418 for Pond 1 and 2, respectively. For the 3-year datasets, these values were 0.334 and 0.381 for Pond 1 and 2, respectively. In most cases there was no significant difference (P > 0.05) between the RMSE of RF and other ML models when these RMSE were treated as statistics derived from 10-fold cross-validation performed with five repeats. Important E. coli predictors were turbidity, dissolved organic matter content, specific conductance, chlorophyll concentration, and temperature. Model predictive performance did not significantly differ when 5 predictors were used vs. 8 or 12, indicating that more tedious and costly measurements provide no substantial improvement in the predictive accuracy of the evaluated algorithms.


2022 ◽  
Vol 12 ◽  
Author(s):  
Yu-Chi Lee ◽  
Jacob J. Christensen ◽  
Laurence D. Parnell ◽  
Caren E. Smith ◽  
Jonathan Shao ◽  
...  

Obesity is associated with many chronic diseases that impair healthy aging and is governed by genetic, epigenetic, and environmental factors and their complex interactions. This study aimed to develop a model that predicts an individual’s risk of obesity by better characterizing these complex relations and interactions focusing on dietary factors. For this purpose, we conducted a combined genome-wide and epigenome-wide scan for body mass index (BMI) and up to three-way interactions among 402,793 single nucleotide polymorphisms (SNPs), 415,202 DNA methylation sites (DMSs), and 397 dietary and lifestyle factors using the generalized multifactor dimensionality reduction (GMDR) method. The training set consisted of 1,573 participants in exam 8 of the Framingham Offspring Study (FOS) cohort. After identifying genetic, epigenetic, and dietary factors that passed statistical significance, we applied machine learning (ML) algorithms to predict participants’ obesity status in the test set, taken as a subset of independent samples (n = 394) from the same cohort. The quality and accuracy of prediction models were evaluated using the area under the receiver operating characteristic curve (ROC-AUC). GMDR identified 213 SNPs, 530 DMSs, and 49 dietary and lifestyle factors as significant predictors of obesity. Comparing several ML algorithms, we found that the stochastic gradient boosting model provided the best prediction accuracy for obesity with an overall accuracy of 70%, with ROC-AUC of 0.72 in test set samples. Top predictors of the best-fit model were 21 SNPs, 230 DMSs in genes such as CPT1A, ABCG1, SLC7A11, RNF145, and SREBF1, and 26 dietary factors, including processed meat, diet soda, French fries, high-fat dairy, artificial sweeteners, alcohol intake, and specific nutrients and food components, such as calcium and flavonols. In conclusion, we developed an integrated approach with ML to predict obesity using omics and dietary data. This extends our knowledge of the drivers of obesity, which can inform precision nutrition strategies for the prevention and treatment of obesity.Clinical Trial Registration: [www.ClinicalTrials.gov], the Framingham Heart Study (FHS), [NCT00005121].


Geosciences ◽  
2021 ◽  
Vol 12 (1) ◽  
pp. 15
Author(s):  
Florian Uhl ◽  
Trine Græsdal Rasmussen ◽  
Natascha Oppelt

Along the Baltic coastline of Germany, drifting vegetation and beach cast create overlays at the otherwise sandy or stony beaches. These overlays influence the morphodynamics and structures of the beaches. To better understand the influence of these patchy habitats on coastal environments, regular monitoring is necessary. Most studies, however, have been conducted on spatially larger and temporally more stable occurrences of aquatic vegetation such as floating fields of Sargassum. Nevertheless, drifting vegetation and beach cast pose a particular challenge, as they exhibit high temporal dynamics and sometimes small spatial extent. Regular surveys and mappings are the traditional methods to record their habitats, but they are time-consuming and cost-intensive. Spaceborne remote sensing can provide frequent recordings of the coastal zone at lower cost. Our study therefore aims at the monitoring of drifting vegetation and beach cast on spatial scales between 3 and 10 m. We developed an automated coastline masking algorithm and tested six supervised classification methods and various classification ensembles for their suitability to detect small-scale assemblages of drifting vegetation and beach cast in a study area at the coastline of the Western Baltic Sea using multispectral data of the sensors Sentinel-2 MSI and PlanetScope. The shoreline masking algorithm shows high accuracies in masking the land area while preserving the sand-covered shoreline. We could achieve best classification results using PlanetScope data with an ensemble of a random forest classifier, cart classifier, support vector machine classifier, naïve bayes classifier and stochastic gradient boosting classifier. This ensemble accomplished a combined f1-score of 0.95. The accuracy of the Sentinel-2 classifications was lower but still achieved a combined f1-score of 0.86 for the same ensemble. The results of this study can be considered as a starting point for the development of time series analysis of the vegetation dynamics along Baltic beaches.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Ch. Anwar ul Hassan ◽  
Jawaid Iqbal ◽  
Saddam Hussain ◽  
Hussain AlSalman ◽  
Mogeeb A. A. Mosleh ◽  
...  

In the domains of computational and applied mathematics, soft computing, fuzzy logic, and machine learning (ML) are well-known research areas. ML is one of the computational intelligence aspects that may address diverse difficulties in a wide range of applications and systems when it comes to exploitation of historical data. Predicting medical insurance costs using ML approaches is still a problem in the healthcare industry that requires investigation and improvement. Using a series of machine learning algorithms, this study provides a computational intelligence approach for predicting healthcare insurance costs. The proposed research approach uses Linear Regression, Support Vector Regression, Ridge Regressor, Stochastic Gradient Boosting, XGBoost, Decision Tree, Random Forest Regressor, Multiple Linear Regression, and k-Nearest Neighbors A medical insurance cost dataset is acquired from the KAGGLE repository for this purpose, and machine learning methods are used to show how different regression models can forecast insurance costs and to compare the models’ accuracy. The results shows that the Stochastic Gradient Boosting (SGB) model outperforms the others with a cross-validation value of 0.0.858 and RMSE value of 0.340 and gives 86% accuracy.


Author(s):  
Donald Douglas Atsa'am ◽  
Ruth Wario

The coronavirus disease-2019 (COVID-19) pandemic is an ongoing concern that requires research in all disciplines to tame its spread. Nine classification algorithms were selected for evaluating the most appropriate in predicting the prevalent COVID-19 transmission mode in a geographic area. These include; multinomial logistic regression, k-nearest neighbour, support vector machines, linear discriminant analysis, naïve Bayes, C5.0, bagged classification and regression trees, random forest, and stochastic gradient boosting. Five COVID-19 datasets were employed for classification. Predictive accuracy was determined using 10-fold cross validation with three repeats. The Friedman’s test was conducted and the outcome showed the performance of each algorithm is significantly different. The stochastic gradient boosting yielded the highest predictive accuracy, 81%. This finding should be valuable to health informaticians, health analysts and others regarding which machine learning tool to adopt in the efforts to detect dominant transmission mode of the virus within localities.


The coronavirus disease-2019 (COVID-19) pandemic is an ongoing concern that requires research in all disciplines to tame its spread. Nine classification algorithms were selected for evaluating the most appropriate in predicting the prevalent COVID-19 transmission mode in a geographic area. These include; multinomial logistic regression, k-nearest neighbour, support vector machines, linear discriminant analysis, naïve Bayes, C5.0, bagged classification and regression trees, random forest, and stochastic gradient boosting. Five COVID-19 datasets were employed for classification. Predictive accuracy was determined using 10-fold cross validation with three repeats. The Friedman’s test was conducted and the outcome showed the performance of each algorithm is significantly different. The stochastic gradient boosting yielded the highest predictive accuracy, 81%. This finding should be valuable to health informaticians, health analysts and others regarding which machine learning tool to adopt in the efforts to detect dominant transmission mode of the virus within localities.


2021 ◽  
Vol 12 (7) ◽  
pp. 358-372
Author(s):  
E. V. Orlova ◽  

The article considers the problem of reducing the banks credit risks associated with the insolvency of borrowers — individuals using financial, socio-economic factors and additional data about borrowers digital footprint. A critical analysis of existing approaches, methods and models in this area has been carried out and a number of significant shortcomings identified that limit their application. There is no comprehensive approach to identifying a borrowers creditworthiness based on information, including data from social networks and search engines. The new methodological approach for assessing the borrowers risk profile based on the phased processing of quantitative and qualitative data and modeling using methods of statistical analysis and machine learning is proposed. Machine learning methods are supposed to solve clustering and classification problems. They allow to automatically determine the data structure and make decisions through flexible and local training on the data. The method of hierarchical clustering and the k-means method are used to identify similar social, anthropometric and financial indicators, as well as indicators characterizing the digital footprint of borrowers, and to determine the borrowers risk profile over group. The obtained homogeneous groups of borrowers with a unique risk profile are further used for detailed data analysis in the predictive classification model. The classification model is based on the stochastic gradient boosting method to predict the risk profile of a potencial borrower. The suggested approach for individuals creditworthiness assessing will reduce the banks credit risks, increase its stability and profitability. The implementation results are of practical importance. Comparative analysis of the effectiveness of the existing and the proposed methodology for assessing credit risk showed that the new methodology provides predictive ana­lytics of heterogeneous information about a potential borrower and the accuracy of analytics is higher. The proposed techniques are the core for the decision support system for justification of individuals credit conditions, minimizing the aggregate credit risks.


2021 ◽  
pp. 289-301
Author(s):  
B. Martín ◽  
J. González–Arias ◽  
J. A. Vicente–Vírseda

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.


2021 ◽  
Vol 26 (8) ◽  
pp. 4505
Author(s):  
B. I. Geltser ◽  
V. Yu. Rublev ◽  
M. M. Tsivanyuk ◽  
K. I. Shakhgeldyan

Machine learning (ML) is among the main tools of artificial intelligence and are increasingly used in population and clinical cardiology to stratify cardiovascular risk. The systematic review presents an analysis of literature on using various ML methods (artificial neural networks, random forest, stochastic gradient boosting, support vector machines, etc.) to develop predictive models determining the immediate and long-term risk of adverse events after coronary artery bypass grafting and percutaneous coronary intervention. Most of the research on this issue is focused on creation of novel forecast models with a higher predictive value. It is emphasized that the improvement of modeling technologies and the development of clinical decision support systems is one of the most promising areas of digitalizing healthcare that are in demand in everyday professional activities.


Sign in / Sign up

Export Citation Format

Share Document