Testing different machine learning techniques for runoff routing in a highly glacierized Djankuat river basin (the North Caucasus, Russia).

Gradient Boosting ◽

Water Runoff ◽

North Caucasus ◽

Time Step ◽

The North ◽

Learning Techniques ◽

Runoff Routing

Distributed, physically based modelling of runoff routing in highly glacierized river basins is an extremely complicated task as glacier drainage systems functioning is very sophisticated, close to karst river systems but also dynamically developing within very short time periods. Accordingly, runoff routing of glacier melt water is most often based on the concept of linear storage. The number of reservoirs generally vary from 1 to 3. For example, one &#160;&#8216;fast&#8217; reservoir for melt&#160; and&#160; rain&#160; in&#160; glacierized&#160; grid&#160; cells in GERM model, three parallel different linear reservoirs representing snow, firn and ice in GSM-SOCONT model.Here we test applicability of different machine learning techniques (gradient boosting, random forest, LSTM) for runoff routing in a highly glacierized river basin. We use the data from Djankuat alpine research catchment located in the North Caucasus (Russia) for the period of 2007&#173;&#173;-2019. The dataset contains different parameters measured with an hourly or sub-daily time step: water runoff, conductivity, turbidity, temperature, 18O, D content at the main gauging station; measurements of precipitation amount, standard meteorological parameters and radiation fluxes. Results of snow and ice melting modelling in the Djankuat river basin over a regular net with an hourly time step using energy-balance distributed A-Melt model are also used as input data.Total runoff from the Djankuat river basin (1) and meltwater runoff according to isotopic hydrograph separation (2) were chosen as target functions. Different sets of features to predict the target functions were generated from the original time series using different combinations of the input parameters as well as variable lag times. To score different machine learning techniques and sets of features to predict target function we use correlation coefficient, Nash-Sutcliff efficiency index (NSE), root mean square error (RMSE).The study was supported by the Russian Foundation for Basic Research, grant No. 20-35-70024

Feasibility of Machine Learning Algorithms for Predicting the Deformation of Anodic Titanium Films by Modulating Anodization Processes

Materials ◽

10.3390/ma14051089 ◽

2021 ◽

Vol 14 (5) ◽

pp. 1089

Author(s):

Sung-Hee Kim ◽

Chanyoung Jeong

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Multiclass Classification ◽

Machine Learning Algorithms ◽

Smart Manufacturing ◽

Gradient Boosting ◽

Experimental Conditions ◽

Learning Techniques ◽

Tio2 Nanostructures

This study aims to demonstrate the feasibility of applying eight machine learning algorithms to predict the classification of the surface characteristics of titanium oxide (TiO2) nanostructures with different anodization processes. We produced a total of 100 samples, and we assessed changes in TiO2 nanostructures’ thicknesses by performing anodization. We successfully grew TiO2 films with different thicknesses by one-step anodization in ethylene glycol containing NH4F and H2O at applied voltage differences ranging from 10 V to 100 V at various anodization durations. We found that the thicknesses of TiO2 nanostructures are dependent on anodization voltages under time differences. Therefore, we tested the feasibility of applying machine learning algorithms to predict the deformation of TiO2. As the characteristics of TiO2 changed based on the different experimental conditions, we classified its surface pore structure into two categories and four groups. For the classification based on granularity, we assessed layer creation, roughness, pore creation, and pore height. We applied eight machine learning techniques to predict classification for binary and multiclass classification. For binary classification, random forest and gradient boosting algorithm had relatively high performance. However, all eight algorithms had scores higher than 0.93, which signifies high prediction on estimating the presence of pore. In contrast, decision tree and three ensemble methods had a relatively higher performance for multiclass classification, with an accuracy rate greater than 0.79. The weakest algorithm used was k-nearest neighbors for both binary and multiclass classifications. We believe that these results show that we can apply machine learning techniques to predict surface quality improvement, leading to smart manufacturing technology to better control color appearance, super-hydrophobicity, super-hydrophilicity or batter efficiency.

Predictors of outpatients’ no-show: big data analytics using apache spark

Journal Of Big Data ◽

10.1186/s40537-020-00384-9 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Tahani Daghistani ◽

Huda AlGhamdi ◽

Riyad Alshammari ◽

Raed H. AlHazme

Keyword(s):

Machine Learning ◽

Big Data ◽

Negative Impact ◽

Big Data Analytics ◽

Quality Of Healthcare ◽

Gradient Boosting ◽

Healthcare Organizations ◽

Data Framework ◽

AbstractOutpatients who fail to attend their appointments have a negative impact on the healthcare outcome. Thus, healthcare organizations facing new opportunities, one of them is to improve the quality of healthcare. The main challenges is predictive analysis using techniques capable of handle the huge data generated. We propose a big data framework for identifying subject outpatients’ no-show via feature engineering and machine learning (MLlib) in the Spark platform. This study evaluates the performance of five machine learning techniques, using the (2,011,813‬) outpatients’ visits data. Conducting several experiments and using different validation methods, the Gradient Boosting (GB) performed best, resulting in an increase of accuracy and ROC to 79% and 81%, respectively. In addition, we showed that exploring and evaluating the performance of the machine learning models using various evaluation methods is critical as the accuracy of prediction can significantly differ. The aim of this paper is exploring factors that affect no-show rate and can be used to formulate predictions using big data machine learning techniques.

An integration of geospatial and machine learning techniques for mapping groundwater potential: a case study of the Shipra river basin, India

Arabian Journal of Geosciences ◽

10.1007/s12517-021-07871-0 ◽

2021 ◽

Vol 14 (16) ◽

Author(s):

Ruchir Patidar ◽

Santosh Murlidhar Pingale ◽

Deepak Khare

Keyword(s):

Machine Learning ◽

River Basin ◽

Groundwater Potential ◽

Djankuat glacier station in the North Caucasus, Russia: a database of glaciological, hydrological, and meteorological observations and stable isotope sampling results during 2007–2017

Earth System Science Data ◽

10.5194/essd-11-1463-2019 ◽

2019 ◽

Vol 11 (3) ◽

pp. 1463-1481 ◽

Cited By ~ 4

Author(s):

Ekaterina P. Rets ◽

Viktor V. Popovnin ◽

Pavel A. Toropov ◽

Andrew M. Smirnov ◽

Igor V. Tokarev ◽

...

Keyword(s):

Wind Speed ◽

Precipitation Amount ◽

Turbulent Fluxes ◽

Water Runoff ◽

North Caucasus ◽

Time Step ◽

Glacier Surface ◽

The North ◽

Meteorological Observations ◽

The Impact

Abstract. This study presents a dataset on long-term multidisciplinary glaciological, hydrological, and meteorological observations and isotope sampling in a sparsely monitored alpine zone of the North Caucasus in the Djankuat research basin. The Djankuat glacier, which is the largest in the basin, was chosen as representative of the central North Caucasus during the International Hydrological Decade and is one of 30 “reference” glaciers in the world that have annual mass balance series longer than 50 years (Zemp et al., 2009). The dataset features a comprehensive set of observations from 2007 to 2017 and contains yearly measurements of snow depth and density; measurements of dynamics of snow and ice melting; measurements of water runoff, conductivity, turbidity, temperature, δ18O, δD at the main gauging station (844 samples in total) with an hourly or sub-daily time step depending on the parameter; data on δ18O and δ2H sampling of liquid precipitation, snow, ice, firn, and groundwater in different parts of the watershed taken regularly during melting season (485 samples in total); measurements of precipitation amount, air temperature, relative humidity, shortwave incoming and reflected radiation, longwave downward and upward radiation, atmospheric pressure, and wind speed and direction – measured at several automatic weather stations within the basin with 15 min to 1 h time steps; gradient meteorological measurements to estimate turbulent fluxes of heat and moisture, measuring three components of wind speed at a frequency of 10 Hz to estimate the impulse of turbulent fluxes of sensible and latent heat over the glacier surface by the eddy covariance method. Data were collected during the ablation period (June–September). The observations were halted in winter. The dataset is available from PANGAEA (https://doi.org/10.1594/PANGAEA.894807, Rets et al., 2018a) and will be further updated. The dataset can be useful for developing and verifying hydrological, glaciological, and meteorological models for alpine areas, to study the impact of climate change on hydrology of mountain regions using isotopic and hydrochemical approaches in hydrology. As the dataset includes the measurements of hydrometeorological and glaciological variables during the catastrophic proglacial lake outburst in the neighboring Bashkara valley in September 2017, it is a valuable contribution to study lake outbursts.

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques

Journal Of Big Data ◽

10.1186/s40537-020-00345-2 ◽

2020 ◽

Vol 7 (1) ◽

Cited By ~ 1

Author(s):

Samiul Islam ◽

Saman Hassanzadeh Amin

Keyword(s):

Machine Learning ◽

Supply Chain ◽

Random Forest ◽

Gradient Boosting ◽

Learning Techniques ◽

Gradient Boosting Machine

Estimating Warehouse Rental Price using Machine Learning Techniques

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2018.2.3034 ◽

2018 ◽

Vol 13 (2) ◽

pp. 235-250 ◽

Cited By ~ 3

Author(s):

Yixuan Ma ◽

Zhenji Zhang ◽

Alexander Ihler ◽

Baoxiang Pan

Keyword(s):

Machine Learning ◽

Random Forest ◽

Real Estate ◽

Rapid Development ◽

Supply And Demand ◽

Gradient Boosting ◽

Logistics Industry ◽

Real Estate Price ◽

Boosted by the growing logistics industry and digital transformation, the sharing warehouse market is undergoing a rapid development. Both supply and demand sides in the warehouse rental business are faced with market perturbations brought by unprecedented peer competitions and information transparency. A key question faced by the participants is how to price warehouses in the open market. To understand the pricing mechanism, we built a real world warehouse dataset using data collected from the classified advertisements websites. Based on the dataset, we applied machine learning techniques to relate warehouse price with its relevant features, such as warehouse size, location and nearby real estate price. Four candidate models are used here: Linear Regression, Regression Tree, Random Forest Regression and Gradient Boosting Regression Trees. The case study in the Beijing area shows that warehouse rent is closely related to its location and land price. Models considering multiple factors have better skill in estimating warehouse rent, compared to singlefactor estimation. Additionally, tree models have better performance than the linear model, with the best model (Random Forest) achieving correlation coefficient of 0.57 in the test set. Deeper investigation of feature importance illustrates that distance from the city center plays the most important role in determining warehouse price in Beijing, followed by nearby real estate price and warehouse size.

Machine learning techniques to predict daily rainfall amount

Journal Of Big Data ◽

10.1186/s40537-021-00545-4 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Chalachew Muluken Liyew ◽

Haileyesus Amsaya Melese

Keyword(s):

Machine Learning ◽

Pearson Correlation ◽

Daily Rainfall ◽

Learning Model ◽

Gradient Boosting ◽

Correlation Technique ◽

Learning Techniques ◽

Machine Learning Model ◽

Extreme Gradient Boosting

AbstractPredicting the amount of daily rainfall improves agricultural productivity and secures food and water supply to keep citizens healthy. To predict rainfall, several types of research have been conducted using data mining and machine learning techniques of different countries’ environmental datasets. An erratic rainfall distribution in the country affects the agriculture on which the economy of the country depends on. Wise use of rainfall water should be planned and practiced in the country to minimize the problem of the drought and flood occurred in the country. The main objective of this study is to identify the relevant atmospheric features that cause rainfall and predict the intensity of daily rainfall using machine learning techniques. The Pearson correlation technique was used to select relevant environmental variables which were used as an input for the machine learning model. The dataset was collected from the local meteorological office at Bahir Dar City, Ethiopia to measure the performance of three machine learning techniques (Multivariate Linear Regression, Random Forest, and Extreme Gradient Boost). Root mean squared error and Mean absolute Error methods were used to measure the performance of the machine learning model. The result of the study revealed that the Extreme Gradient Boosting machine learning algorithm performed better than others.

APPLICATION OF MACHINE LEARNING TECHNIQUES TO PREDICT NITROGEN LEVELS IN THE MISSISSIPPI RIVER BASIN

10.1130/abs/2018am-321752 ◽

2018 ◽

Author(s):

Srinivas Anand ◽

◽

Adam S. Ward

Keyword(s):

Machine Learning ◽

River Basin ◽

Mississippi River ◽

Mississippi River Basin ◽

Nitrogen Levels ◽

Development and Application of a Genetic Algorithm for Variable Optimization and Predictive Modeling of Five-Year Mortality Using Questionnaire Data

Bioinformatics and Biology Insights ◽

10.4137/bbi.s29469 ◽

2015 ◽

Vol 9s3 ◽

pp. BBI.S29469 ◽

Cited By ~ 6

Author(s):

Lucas J. Adams ◽

Ghalib Bello ◽

Gerard G. Dumancas

Keyword(s):

Machine Learning ◽

Genetic Algorithm ◽

Predictive Modeling ◽

Gradient Boosting ◽

Questionnaire Data ◽

Special Equipment ◽

Learning Techniques ◽

Specific Outcome ◽

Performance Area

The problem of selecting important variables for predictive modeling of a specific outcome of interest using questionnaire data has rarely been addressed in clinical settings. In this study, we implemented a genetic algorithm (GA) technique to select optimal variables from questionnaire data for predicting a five-year mortality. We examined 123 questions (variables) answered by 5,444 individuals in the National Health and Nutrition Examination Survey. The GA iterations selected the top 24 variables, including questions related to stroke, emphysema, and general health problems requiring the use of special equipment, for use in predictive modeling by various parametric and nonparametric machine learning techniques. Using these top 24 variables, gradient boosting yielded the nominally highest performance (area under curve [AUC] = 0.7654), although there were other techniques with lower but not significantly different AUC. This study shows how GA in conjunction with various machine learning techniques could be used to examine questionnaire data to predict a binary outcome.