Kriging-Based Land-Use Regression Models That Use Machine Learning Algorithms to Estimate the Monthly BTEX Concentration

Chin-Yu Hsu; Yu-Ting Zeng; Yu-Cheng Chen; Mu-Jean Chen; Shih-Chun Candice Lung; Chih-Da Wu

doi:10.3390/ijerph17196956

Kriging-Based Land-Use Regression Models That Use Machine Learning Algorithms to Estimate the Monthly BTEX Concentration

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph17196956 ◽

2020 ◽

Vol 17 (19) ◽

pp. 6956 ◽

Cited By ~ 1

Author(s):

Chin-Yu Hsu ◽

Yu-Ting Zeng ◽

Yu-Cheng Chen ◽

Mu-Jean Chen ◽

Shih-Chun Candice Lung ◽

...

Keyword(s):

Machine Learning ◽

Land Use ◽

Environmental Protection Agency ◽

Learning Algorithms ◽

Explanatory Power ◽

Model Development ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Land Use Regression ◽

Extreme Gradient Boosting

This paper uses machine learning to refine a Land-use Regression (LUR) model and to estimate the spatial–temporal variation in BTEX concentrations in Kaohsiung, Taiwan. Using the Taiwanese Environmental Protection Agency (EPA) data of BTEX (benzene, toluene, ethylbenzene, and xylenes) concentrations from 2015 to 2018, which includes local emission sources as a result of Asian cultural characteristics, a new LUR model is developed. The 2019 data was then used as external data to verify the reliability of the model. We used hybrid Kriging-land-use regression (Hybrid Kriging-LUR) models, geographically weighted regression (GWR), and two machine learning algorithms—random forest (RF) and extreme gradient boosting (XGBoost)—for model development. Initially, the proposed Hybrid Kriging-LUR models explained each variation in BTEX from 37% to 52%. Using machine learning algorithms (XGBoost) increased the explanatory power of the models for each BTEX, between 61% and 79%. This study compared each combination of the Hybrid Kriging-LUR model and (i) GWR, (ii) RF, and (iii) XGBoost algorithm to estimate the spatiotemporal variation in BTEX concentration. It is shown that a combination of Hybrid Kriging-LUR and the XGBoost algorithm gives better performance than other integrated methods.

Download Full-text

Incorporating land-use regression into machine learning algorithms in estimating the spatial-temporal variation of carbon monoxide in Taiwan

Environmental Modelling & Software ◽

10.1016/j.envsoft.2021.104996 ◽

2021 ◽

Vol 139 ◽

pp. 104996

Author(s):

Pei-Yi Wong ◽

Chin-Yu Hsu ◽

Jhao-Yi Wu ◽

Tee-Ann Teo ◽

Jen-Wei Huang ◽

...

Keyword(s):

Machine Learning ◽

Land Use ◽

Carbon Monoxide ◽

Temporal Variation ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Land Use Regression

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Machine learning algorithms for predicting undernutrition among under-five children in Ethiopia

Public Health Nutrition ◽

10.1017/s1368980021004262 ◽

2021 ◽

pp. 1-29

Author(s):

Fikrewold H. Bitew ◽

Corey S. Sparks ◽

Samuel H. Nyarko

Keyword(s):

Machine Learning ◽

Linear Models ◽

Learning Algorithms ◽

Public Health Problem ◽

Water Source ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Global Public Health ◽

Prediction Ability ◽

Extreme Gradient Boosting

Abstract Objective: Child undernutrition is a global public health problem with serious implications. In this study, estimate predictive algorithms for the determinants of childhood stunting by using various machine learning (ML) algorithms. Design: This study draws on data from the Ethiopian Demographic and Health Survey of 2016. Five machine learning algorithms including eXtreme gradient boosting (xgbTree), k-nearest neighbors (K-NN), random forest (RF), neural network (NNet), and the generalized linear models (GLM) were considered to predict the socio-demographic risk factors for undernutrition in Ethiopia. Setting: Households in Ethiopia. Participants: A total of 9,471 children below five years of age. Results: The descriptive results show substantial regional variations in child stunting, wasting, and underweight in Ethiopia. Also, among the five ML algorithms, xgbTree algorithm shows a better prediction ability than the generalized linear mixed algorithm. The best predicting algorithm (xgbTree) shows diverse important predictors of undernutrition across the three outcomes which include time to water source, anemia history, child age greater than 30 months, small birth size, and maternal underweight, among others. Conclusions: The xgbTree algorithm was a reasonably superior ML algorithm for predicting childhood undernutrition in Ethiopia compared to other ML algorithms considered in this study. The findings support improvement in access to water supply, food security, and fertility regulation among others in the quest to considerably improve childhood nutrition in Ethiopia.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Systematic Evaluation of Machine Learning Algorithms for Neuroanatomically-Based Age Prediction in Youth

10.1101/2021.11.24.469888 ◽

2021 ◽

Author(s):

Mandana Modabbernia ◽

Heather C Whalley ◽

David Glahn ◽

Paul M. Thompson ◽

Rene S. Kahn ◽

...

Keyword(s):

Machine Learning ◽

Computational Efficiency ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Sensitivity Analyses ◽

Gradient Boosting ◽

Support Vector ◽

Age Related ◽

Extreme Gradient Boosting ◽

Brain Age

Application of machine learning algorithms to structural magnetic resonance imaging (sMRI) data has yielded behaviorally meaningful estimates of the biological age of the brain (brain-age). The choice of the machine learning approach in estimating brain-age in children and adolescents is important because age-related brain changes in these age-groups are dynamic. However, the comparative performance of the multiple machine learning algorithms available has not been systematically appraised. To address this gap, the present study evaluated the accuracy (Mean Absolute Error; MAE) and computational efficiency of 21 machine learning algorithms using sMRI data from 2,105 typically developing individuals aged 5 to 22 years from five cohorts. The trained models were then tested in an independent holdout datasets, comprising 4,078 pre-adolescents (aged 9-10 years). The algorithms encompassed parametric and nonparametric, Bayesian, linear and nonlinear, tree-based, and kernel-based models. Sensitivity analyses were performed for parcellation scheme, number of neuroimaging input features, number of cross-validation folds, and sample size. The best performing algorithms were Extreme Gradient Boosting (MAE of 1.25 years for females and 1.57 years for males), Random Forest Regression (MAE of 1.23 years for females and 1.65 years for males) and Support Vector Regression with Radial Basis Function Kernel (MAE of 1.47 years for females and 1.72 years for males) which had acceptable and comparable computational efficiency. Findings of the present study could be used as a guide for optimizing methodology when quantifying age-related changes during development.

Download Full-text

Prediction of groundwater quality indices using machine learning algorithms

Water Practice & Technology ◽

10.2166/wpt.2021.120 ◽

2021 ◽

Author(s):

Hemant Raheja ◽

Arun Goel ◽

Mahesh Pal

Keyword(s):

Machine Learning ◽

Water Quality ◽

Water Quality Index ◽

Quality Index ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Quality Indices ◽

Extreme Gradient Boosting ◽

Analysis Of Results

Abstract The present paper deals with performance evaluation of application of three machine learning algorithms such as Deep neural network (DNN), Gradient boosting machine (GBM) and Extreme gradient boosting (XGBoost) to evaluate the ground water indices over a study area of Haryana state (India). To investigate the applicability of these models, two water quality indices namely Entropy Water Quality Index (EWQI) and Water Quality Index (WQI) are employed in the present study. Analysis of results demonstrated that DNN has exhibited comparatively lower error values and it performed better in the prediction of both indices i.e. EWQI and WQI. The values of Correlation Coefficient (CC = 0.989), Root Mean Square Error (RMSE = 0.037), Nash–Sutcliffe efficiency (NSE = 0.995), Index of agreement (d = 0.999) for EWQI and CC = 0.975, RMSE = 0.055, NSE = 0.991, d = 0.998 for WQI have been obtained. From variable importance of input parameters, the Electrical conductivity (EC) was observed to be most significant and ‘pH’ was least significant parameter in predictions of EWQI and WQI using these three models. It is envisaged that the results of study can be used to righteously predict EWQI and WQI of groundwater to decide its potability.

Download Full-text

Bionic Electronic Nose Based on MOS Sensors Array and Machine Learning Algorithms Used for Wine Properties Detection

Sensors ◽

10.3390/s19010045 ◽

2018 ◽

Vol 19 (1) ◽

pp. 45 ◽

Cited By ~ 19

Author(s):

Huixiang Liu ◽

Qing Li ◽

Bin Yan ◽

Lei Zhang ◽

Yu Gu

Keyword(s):

Machine Learning ◽

Electronic Nose ◽

Optimal Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Oxide Semiconductor ◽

Gradient Boosting ◽

Support Vector ◽

Fermentation Processes ◽

Extreme Gradient Boosting

In this study, a portable electronic nose (E-nose) prototype is developed using metal oxide semiconductor (MOS) sensors to detect odors of different wines. Odor detection facilitates the distinction of wines with different properties, including areas of production, vintage years, fermentation processes, and varietals. Four popular machine learning algorithms—extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and backpropagation neural network (BPNN)—were used to build identification models for different classification tasks. Experimental results show that BPNN achieved the best performance, with accuracies of 94% and 92.5% in identifying production areas and varietals, respectively; and SVM achieved the best performance in identifying vintages and fermentation processes, with accuracies of 67.3% and 60.5%, respectively. Results demonstrate the effectiveness of the developed E-nose, which could be used to distinguish different wines based on their properties following selection of an optimal algorithm.

Download Full-text

A multipurpose machine learning approach to predict COVID-19 negative prognosis in São Paulo, Brazil

Scientific Reports ◽

10.1038/s41598-021-82885-y ◽

2021 ◽

Vol 11 (1) ◽

Cited By ~ 2

Author(s):

Fernando Timoteo Fernandes ◽

Tiago Almeida de Oliveira ◽

Cristiane Esteves Teixeira ◽

Andre Filipe de Moraes Batista ◽

Gabriel Dalla Costa ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Sao Paulo ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

São Paulo ◽

C Reactive Protein ◽

Reactive Protein ◽

Unseen Data ◽

Extreme Gradient Boosting

AbstractThe new coronavirus disease (COVID-19) is a challenge for clinical decision-making and the effective allocation of healthcare resources. An accurate prognostic assessment is necessary to improve survival of patients, especially in developing countries. This study proposes to predict the risk of developing critical conditions in COVID-19 patients by training multipurpose algorithms. We followed a total of 1040 patients with a positive RT-PCR diagnosis for COVID-19 from a large hospital from São Paulo, Brazil, from March to June 2020, of which 288 (28%) presented a severe prognosis, i.e. Intensive Care Unit (ICU) admission, use of mechanical ventilation or death. We used routinely-collected laboratory, clinical and demographic data to train five machine learning algorithms (artificial neural networks, extra trees, random forests, catboost, and extreme gradient boosting). We used a random sample of 70% of patients to train the algorithms and 30% were left for performance assessment, simulating new unseen data. In order to assess if the algorithms could capture general severe prognostic patterns, each model was trained by combining two out of three outcomes to predict the other. All algorithms presented very high predictive performance (average AUROC of 0.92, sensitivity of 0.92, and specificity of 0.82). The three most important variables for the multipurpose algorithms were ratio of lymphocyte per C-reactive protein, C-reactive protein and Braden Scale. The results highlight the possibility that machine learning algorithms are able to predict unspecific negative COVID-19 outcomes from routinely-collected data.

Download Full-text

A multipurpose machine learning approach to predict COVID-19 negative prognosis in Sao Paulo, Brazil

10.1101/2020.08.26.20182584 ◽

2020 ◽

Author(s):

Fernando Timoteo Fernandes ◽

Tiago Almeida de Oliveira ◽

Cristiane Esteves Teixeira ◽

Andre Filipe de Moraes Batista ◽

Gabriel Dalla Costa ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Sao Paulo ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

São Paulo ◽

C Reactive Protein ◽

Reactive Protein ◽

Unseen Data ◽

Extreme Gradient Boosting

Introduction The new coronavirus disease (COVID-19) is a challenge for clinical decision-making and the effective allocation of healthcare resources. An accurate prognostic assessment is necessary to improve survival of patients, especially in developing countries. This study proposes to predict the risk of developing critical conditions in COVID-19 patients by training multipurpose algorithms. Methods A total of 1,040 patients with a positive RT-PCR diagnosis for COVID-19 from a large hospital from Sao Paulo, Brazil, were followed from March to June 2020, of which 288 (28%) presented a severe prognosis, i.e. Intensive Care Unit (ICU) admission, use of mechanical ventilation or death. Routinely-collected laboratory, clinical and demographic data was used to train five machine learning algorithms (artificial neural networks, extra trees, random forests, catboost, and extreme gradient boosting). A random sample of 70% of patients was used to train the algorithms and 30% were left for performance assessment, simulating new unseen data. In order to assess if the algorithms could capture general severe prognostic patterns, each model was trained by combining two out of three outcomes to predict the other. Results All algorithms presented very high predictive performance (average AUROC of 0.92, sensitivity of 0.92, and specificity of 0.82). The three most important variables for the multipurpose algorithms were ratio of lymphocyte per C-reactive protein, C-reactive protein and Braden Scale. Conclusion The results highlight the possibility that machine learning algorithms are able to predict unspecific negative COVID-19 outcomes from routinely-collected data.

Download Full-text

Predicting and Mapping of Soil Organic Carbon Using Machine Learning Algorithms in Northern Iran

Remote Sensing ◽

10.3390/rs12142234 ◽

2020 ◽

Vol 12 (14) ◽

pp. 2234 ◽

Cited By ~ 6

Author(s):

Mostafa Emadi ◽

Ruhollah Taghizadeh-Mehrjardi ◽

Ali Cherati ◽

Majid Danesh ◽

Amir Mosavi ◽

...

Keyword(s):

Machine Learning ◽

Organic Carbon ◽

Soil Organic Carbon ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Composite Surface ◽

Auxiliary Data ◽

Extreme Gradient Boosting

Estimation of the soil organic carbon (SOC) content is of utmost importance in understanding the chemical, physical, and biological functions of the soil. This study proposes machine learning algorithms of support vector machines (SVM), artificial neural networks (ANN), regression tree, random forest (RF), extreme gradient boosting (XGBoost), and conventional deep neural network (DNN) for advancing prediction models of SOC. Models are trained with 1879 composite surface soil samples, and 105 auxiliary data as predictors. The genetic algorithm is used as a feature selection approach to identify effective variables. The results indicate that precipitation is the most important predictor driving 14.9% of SOC spatial variability followed by the normalized difference vegetation index (12.5%), day temperature index of moderate resolution imaging spectroradiometer (10.6%), multiresolution valley bottom flatness (8.7%) and land use (8.2%), respectively. Based on 10-fold cross-validation, the DNN model reported as a superior algorithm with the lowest prediction error and uncertainty. In terms of accuracy, DNN yielded a mean absolute error of 0.59%, a root mean squared error of 0.75%, a coefficient of determination of 0.65, and Lin’s concordance correlation coefficient of 0.83. The SOC content was the highest in udic soil moisture regime class with mean values of 3.71%, followed by the aquic (2.45%) and xeric (2.10%) classes, respectively. Soils in dense forestlands had the highest SOC contents, whereas soils of younger geological age and alluvial fans had lower SOC. The proposed DNN (hidden layers = 7, and size = 50) is a promising algorithm for handling large numbers of auxiliary data at a province-scale, and due to its flexible structure and the ability to extract more information from the auxiliary data surrounding the sampled observations, it had high accuracy for the prediction of the SOC base-line map and minimal uncertainty.

Download Full-text