Vorhersage der Fließgewässertemperaturen in österreichischen Einzugsgebieten mittels Machine Learning-Verfahren

ZusammenfassungDie Fließgewässertemperatur ist ein essenzieller Umweltfaktor, der das Potenzial hat, sowohl ökologische als auch sozio-ökonomische Rahmenbedingungen im Umfeld eines Gewässers zu verändern. Um Fließgewässertemperaturen als Grundlage für effektive Anpassungsstrategien für zukünftige Veränderungen (z. B. durch den Klimawandel) berechnen zu können, sind adäquate Modellierungskonzepte notwendig. Die vorliegende Studie untersucht hierfür 6 Machine Learning-Modelle: Schrittweise Lineare Regression, Random Forest, eXtreme Gradient Boosting, Feedforward Neural Networks und zwei Arten von Recurrent Neural Networks. Die Modelle wurden an 10 österreichischen Einzugsgebieten mit unterschiedlichen physiographischen Eigenschaften und Eingangsdatenkombinationen getestet. Die Hyperparameter der angewandten Modelle wurden mittels Bayes’scher Hyperparameteroptimierung optimiert. Um die Ergebnisse mit anderen Studien vergleichbar zu machen, wurden die Vorhersagen der 6 Machine Learning-Modelle den Ergebnissen der linearen Regression und dem häufig verwendeten und bekannten Wassertemperaturmodell air2stream gegenübergestellt.Von den 6 getesteten Modellen zeigten die Feedforward Neural Networks und das eXtreme Gradient Boosting die besten Vorhersagen in jeweils 4 von 10 Einzugsgebieten. Mit einem durchschnittlichen RMSE (Wurzel der mittleren Fehlerquadratsumme; root mean squared error) von 0,55 °C konnten die getesteten Modelle die Fließgewässertemperaturen deutlich besser prognostizieren als die lineare Regression (1,55 °C) und air2stream (0,98 °C). Generell zeigten die Ergebnisse der 6 Modelle eine sehr vergleichbare Leistung mit lediglich einer mittleren Abweichung um den Medianwert von 0,08 °C zwischen den einzelnen Modellen. Im größten untersuchten Einzugsgebiet – Donau bei Kienstock – wiesen Recurrent Neural Networks die höchste Modellgüte auf, was darauf hinweist, dass sie sich am besten eignen, wenn im Einzugsgebiet Prozesse mit langfristigen Abhängigkeiten ausschlaggebend sind. Die Wahl der Hyperparameter beeinflusste die Vorhersagefähigkeit der Modelle stark, was die Bedeutung der Hyperparameteroptimierung besonders hervorhebt.Die Ergebnisse dieser Studie fassen die Bedeutung unterschiedlicher Eingangsdaten, Modelle und Trainingscharakteristiken für die Modellierung von mittleren täglichen Fließgewässertemperaturen zusammen. Gleichzeitig dient diese Studie als Basis für die Entwicklung zukünftiger Modelle für eine regionale Fließgewässertemperaturvorhersage. Die getesteten Modelle stehen im open source R‑Paket wateRtemp allen AnwenderInnen der Forschungsgemeinschaft und der Praxis zur Verfügung.

Download Full-text

Evaluation of Three Different Machine Learning Methods for Object-Based Artificial Terrace Mapping—A Case Study of the Loess Plateau, China

Remote Sensing ◽

10.3390/rs13051021 ◽

2021 ◽

Vol 13 (5) ◽

pp. 1021

Author(s):

Hu Ding ◽

Jiaming Na ◽

Shangjing Jiang ◽

Jie Zhu ◽

Kai Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Loess Plateau ◽

Water Conservation ◽

Nearest Neighbor ◽

Gradient Boosting ◽

K Nearest Neighbor ◽

The Loess Plateau ◽

Object Based ◽

Extreme Gradient Boosting

Artificial terraces are of great importance for agricultural production and soil and water conservation. Automatic high-accuracy mapping of artificial terraces is the basis of monitoring and related studies. Previous research achieved artificial terrace mapping based on high-resolution digital elevation models (DEMs) or imagery. As a result of the importance of the contextual information for terrace mapping, object-based image analysis (OBIA) combined with machine learning (ML) technologies are widely used. However, the selection of an appropriate classifier is of great importance for the terrace mapping task. In this study, the performance of an integrated framework using OBIA and ML for terrace mapping was tested. A catchment, Zhifanggou, in the Loess Plateau, China, was used as the study area. First, optimized image segmentation was conducted. Then, features from the DEMs and imagery were extracted, and the correlations between the features were analyzed and ranked for classification. Finally, three different commonly-used ML classifiers, namely, extreme gradient boosting (XGBoost), random forest (RF), and k-nearest neighbor (KNN), were used for terrace mapping. The comparison with the ground truth, as delineated by field survey, indicated that random forest performed best, with a 95.60% overall accuracy (followed by 94.16% and 92.33% for XGBoost and KNN, respectively). The influence of class imbalance and feature selection is discussed. This work provides a credible framework for mapping artificial terraces.

Download Full-text

Development and validation of a difficult laryngoscopy prediction model using machine learning of neck circumference and thyromental height

BMC Anesthesiology ◽

10.1186/s12871-021-01343-4 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jong Ho Kim ◽

Haewon Kim ◽

Ji Su Jang ◽

Sung Mi Hwang ◽

So Young Lim ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Confidence Interval ◽

Neck Circumference ◽

Difficult Laryngoscopy ◽

Gradient Boosting ◽

Test Set ◽

Equal Distribution ◽

Light Gradient ◽

Extreme Gradient Boosting

Abstract Background Predicting difficult airway is challengeable in patients with limited airway evaluation. The aim of this study is to develop and validate a model that predicts difficult laryngoscopy by machine learning of neck circumference and thyromental height as predictors that can be used even for patients with limited airway evaluation. Methods Variables for prediction of difficulty laryngoscopy included age, sex, height, weight, body mass index, neck circumference, and thyromental distance. Difficult laryngoscopy was defined as Grade 3 and 4 by the Cormack-Lehane classification. The preanesthesia and anesthesia data of 1677 patients who had undergone general anesthesia at a single center were collected. The data set was randomly stratified into a training set (80%) and a test set (20%), with equal distribution of difficulty laryngoscopy. The training data sets were trained with five algorithms (logistic regression, multilayer perceptron, random forest, extreme gradient boosting, and light gradient boosting machine). The prediction models were validated through a test set. Results The model’s performance using random forest was best (area under receiver operating characteristic curve = 0.79 [95% confidence interval: 0.72–0.86], area under precision-recall curve = 0.32 [95% confidence interval: 0.27–0.37]). Conclusions Machine learning can predict difficult laryngoscopy through a combination of several predictors including neck circumference and thyromental height. The performance of the model can be improved with more data, a new variable and combination of models.

Download Full-text

Machine learning as a successful approach for predicting complex spatio–temporal patterns in animal species abundance

Animal Biodiversity and Conservation ◽

10.32800/abc.2021.44.0289 ◽

2021 ◽

pp. 289-301

Author(s):

B. Martín ◽

J. González–Arias ◽

J. A. Vicente–Vírseda

Keyword(s):

Machine Learning ◽

Random Forest ◽

Animal Species ◽

Temporal Patterns ◽

Additive Models ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Spatio Temporal

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Combined Multilateration with Machine Learning for Enhanced Aircraft Localization

Proceedings ◽

10.3390/proceedings2020059002 ◽

2020 ◽

Vol 59 (1) ◽

pp. 2

Author(s):

Benoit Figuet ◽

Raphael Monstein ◽

Michael Felux

Keyword(s):

Machine Learning ◽

Sensitivity Analysis ◽

Real World ◽

Mean Squared Error ◽

Accurate Estimate ◽

Regression Technique ◽

Gradient Boosting ◽

Root Mean Squared Error ◽

Squared Error ◽

Using Data

In this paper, we present an aircraft localization solution developed in the context of the Aircraft Localization Competition and applied to the OpenSky Network real-world ADS-B data. The developed solution is based on a combination of machine learning and multilateration using data provided by time synchronized ground receivers. A gradient boosting regression technique is used to obtain an estimate of the geometric altitude of the aircraft, as well as a first guess of the 2D aircraft position. Then, a triplet-wise and an all-in-view multilateration technique are implemented to obtain an accurate estimate of the aircraft latitude and longitude. A sensitivity analysis of the accuracy as a function of the number of receivers is conducted and used to optimize the proposed solution. The obtained predictions have an accuracy below 25 m for the 2D root mean squared error and below 35 m for the geometric altitude.

Download Full-text

Random forest and extreme gradient boosting algorithms for streamflow modeling using vessel features, and tree-rings

10.21203/rs.3.rs-303081/v1 ◽

2021 ◽

Author(s):

Hossein Sahour ◽

Vahid Gholami ◽

Javad Torkman ◽

Mehdi Vazifedan ◽

Sirwe Saeedi

Keyword(s):

Machine Learning ◽

Random Forest ◽

Tree Rings ◽

Test Site ◽

Machine Learning Algorithms ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Growing Seasons ◽

Extreme Gradient Boosting ◽

Streamflow Modeling

Abstract Monitoring temporal variation of streamflow is necessary for many water resources management plans, yet, such practices are constrained by the absence or paucity of data in many rivers around the world. Using a permanent river in the north of Iran as a test site, a machine learning framework was proposed to model the streamflow data in the three periods of growing seasons based on tree-rings and vessel features of the Zelkova carpinifolia species. First, full-disc samples were taken from 30 trees near the river, and the samples went through preprocessing, cross-dating, standardization, and time series analysis. Two machine learning algorithms, namely random forest (RF) and extreme gradient boosting (XGB), were used to model the relationships between dendrochronology variables (tree-rings and vessel features in the three periods of growing seasons) and the corresponding streamflow rates. The performance of each model was evaluated using statistical coefficients (coefficient of determination (R-squared), Nash-Sutcliffe efficiency (NSE), and root-mean-square error (NRMSE)). Findings demonstrate that consideration should be given to the XGB model in streamflow modeling given its apparent enhanced performance (R-squared: 0.87; NSE: 0.81; and NRMSE: 0.43) over the RF model (R-squared: 0.82; NSE: 0.71; and NRMSE: 0.52). Further, the results showed that the models perform better in modeling the normal and low flows compared to extremely high flows. Finally, the tested models were used to reconstruct the temporal streamflow during the past decades (1970–1981).

Download Full-text

MRI-Based Machine Learning in Differentiation Between Benign and Malignant Breast Lesions

Frontiers in Oncology ◽

10.3389/fonc.2021.552634 ◽

2021 ◽

Vol 11 ◽

Author(s):

Yanjie Zhao ◽

Rong Chen ◽

Ting Zhang ◽

Chaoyue Chen ◽

Muhetaer Muhelisa ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Texture Analysis ◽

Training Group ◽

Gradient Boosting ◽

Breast Lesions ◽

Extreme Gradient Boosting ◽

Malignant Breast ◽

Benign Breast Lesions

BackgroundDifferential diagnosis between benign and malignant breast lesions is of crucial importance relating to follow-up treatment. Recent development in texture analysis and machine learning may lead to a new solution to this problem.MethodThis current study enrolled a total number of 265 patients (benign breast lesions:malignant breast lesions = 71:194) diagnosed in our hospital and received magnetic resonance imaging between January 2014 and August 2017. Patients were randomly divided into the training group and validation group (4:1), and two radiologists extracted their texture features from the contrast-enhanced T1-weighted images. We performed five different feature selection methods including Distance correlation, Gradient Boosting Decision Tree (GBDT), least absolute shrinkage and selection operator (LASSO), random forest (RF), eXtreme gradient boosting (Xgboost) and five independent classification models were built based on Linear discriminant analysis (LDA) algorithm.ResultsAll five models showed promising results to discriminate malignant breast lesions from benign breast lesions, and the areas under the curve (AUCs) of receiver operating characteristic (ROC) were all above 0.830 in both training and validation groups. The model with a better discriminating ability was the combination of LDA + gradient boosting decision tree (GBDT). The sensitivity, specificity, AUC, and accuracy in the training group were 0.814, 0.883, 0.922, and 0.868, respectively; LDA + random forest (RF) also suggests promising results with the AUC of 0.906 in the training group.ConclusionThe evidence of this study, while preliminary, suggested that a combination of MRI texture analysis and LDA algorithm could discriminate benign breast lesions from malignant breast lesions. Further multicenter researches in this field would be of great help in the validation of the result.

Download Full-text

Estimating Time of Driver Arrival with Gradient Boosting Algorithms and Deep Neural Networks

Mathematical Problems of Computer Science ◽

10.51408/1963-0050 ◽

2020 ◽

pp. 29-38

Author(s):

Henrik Sergoyan

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Mean Squared Error ◽

Service Providers ◽

Gradient Boosting ◽

Time Of Arrival ◽

Squared Error ◽

Transportation Service ◽

Boosting Algorithms ◽

Made In

Customer experience and resource management determine the degree to which transportation service providers can compete in today’s heavily saturated markets. The paper investigates and suggests a new methodology to optimize calculations for Estimated Time of Arrival (from now on ETA, meaning the time it will take for the driver to reach the designated location) based on the data provided by GG collected from rides made in 2018. GG is a transportation service providing company, and it currently uses The Open Source Routing Machine (OSRM) which exhibits significant errors in the prediction phase. This paper shows that implementing algorithms such as XGBoost, CatBoost, and Neural Networks for the said task will improve the accuracy of estimation. Paper discusses the benefits and drawbacks of each model and then considers the performance of the stacking algorithm that combines several models into one. Thus, using those techniques, final results showed that Mean Squared Error (MSE) was decreased by 54% compared to the current GG model.

Download Full-text

Modelos de machine learning para predição do sucesso de startups

Revista de Gestão e Projetos ◽

10.5585/gep.v12i2.18942 ◽

2021 ◽

Vol 12 (2) ◽

pp. 28-55

Author(s):

Fabiano Rodrigues ◽

Francisco Aparecido Rodrigues ◽

Thelma Valéria Rocha Rodrigues

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Decision Tree ◽

Initial Public Offering ◽

Gradient Boosting ◽

Support Vector ◽

Trade Offs ◽

Extreme Gradient Boosting ◽

Public Offering

Este estudo analisa resultados obtidos com modelos de machine learning para predição do sucesso de startups. Como proxy de sucesso considera-se a perspectiva do investidor, na qual a aquisição da startup ou realização de IPO (Initial Public Offering) são formas de recuperação do investimento. A revisão da literatura aborda startups e veículos de financiamento, estudos anteriores sobre predição do sucesso de startups via modelos de machine learning, e trade-offs entre técnicas de machine learning. Na parte empírica, foi realizada uma pesquisa quantitativa baseada em dados secundários oriundos da plataforma americana Crunchbase, com startups de 171 países. O design de pesquisa estabeleceu como filtro startups fundadas entre junho/2010 e junho/2015, e uma janela de predição entre junho/2015 e junho/2020 para prever o sucesso das startups. A amostra utilizada, após etapa de pré-processamento dos dados, foi de 18.571 startups. Foram utilizados seis modelos de classificação binária para a predição: Regressão Logística, Decision Tree, Random Forest, Extreme Gradiente Boosting, Support Vector Machine e Rede Neural. Ao final, os modelos Random Forest e Extreme Gradient Boosting apresentaram os melhores desempenhos na tarefa de classificação. Este artigo, envolvendo machine learning e startups, contribui para áreas de pesquisa híbridas ao mesclar os campos da Administração e Ciência de Dados. Além disso, contribui para investidores com uma ferramenta de mapeamento inicial de startups na busca de targets com maior probabilidade de sucesso.

Download Full-text

Comparison of the Validity and Generalizability of Machine Learning Algorithms for the Prediction of Energy Expenditure: Validation Study

JMIR mhealth and uhealth ◽

10.2196/23938 ◽

2021 ◽

Vol 9 (8) ◽

pp. e23938

Author(s):

Ruairi O'Driscoll ◽

Jake Turicchi ◽

Mark Hopkins ◽

Cristiana Duarte ◽

Graham W Horgan ◽

...

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Neural Networks ◽

Random Forest ◽

Energy Expenditure ◽

Superior Performance ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Out Of Sample

Background Accurate solutions for the estimation of physical activity and energy expenditure at scale are needed for a range of medical and health research fields. Machine learning techniques show promise in research-grade accelerometers, and some evidence indicates that these techniques can be applied to more scalable commercial devices. Objective This study aims to test the validity and out-of-sample generalizability of algorithms for the prediction of energy expenditure in several wearables (ie, Fitbit Charge 2, ActiGraph GT3-x, SenseWear Armband Mini, and Polar H7) using two laboratory data sets comprising different activities. Methods Two laboratory studies (study 1: n=59, age 44.4 years, weight 75.7 kg; study 2: n=30, age=31.9 years, weight=70.6 kg), in which adult participants performed a sequential lab-based activity protocol consisting of resting, household, ambulatory, and nonambulatory tasks, were combined in this study. In both studies, accelerometer and physiological data were collected from the wearables alongside energy expenditure using indirect calorimetry. Three regression algorithms were used to predict metabolic equivalents (METs; ie, random forest, gradient boosting, and neural networks), and five classification algorithms (ie, k-nearest neighbor, support vector machine, random forest, gradient boosting, and neural networks) were used for physical activity intensity classification as sedentary, light, or moderate to vigorous. Algorithms were evaluated using leave-one-subject-out cross-validations and out-of-sample validations. Results The root mean square error (RMSE) was lowest for gradient boosting applied to SenseWear and Polar H7 data (0.91 METs), and in the classification task, gradient boost applied to SenseWear and Polar H7 was the most accurate (85.5%). Fitbit models achieved an RMSE of 1.36 METs and 78.2% accuracy for classification. Errors tended to increase in out-of-sample validations with the SenseWear neural network achieving RMSE values of 1.22 METs in the regression tasks and the SenseWear gradient boost and random forest achieving an accuracy of 80% in classification tasks. Conclusions Algorithms trained on combined data sets demonstrated high predictive accuracy, with a tendency for superior performance of random forests and gradient boosting for most but not all wearable devices. Predictions were poorer in the between-study validations, which creates uncertainty regarding the generalizability of the tested algorithms.

Download Full-text