Predictive modelling of groundwater nitrate pollution at a regional scale using machine learning and feature selection

The establishment of the sources and driven-forces of groundwater nitrate pollution is of paramount importance, contributing to agro-environmental measures implementation and evaluation. High concentrations of nitrates in groundwater occur all around the world, in rich and less developed countries.In the case of Spain, 21.5% of the wells of the groundwater quality monitoring network showed mean concentrations above the quality standard (QS) of 50 mg/l. The objectives of this work were: i) to predict the current probability of having nitrate concentrations above the QS in Andalusian groundwater bodies (Spain) using past time features, being some of them obtained from satellite observations; ii) to assess the importance of features in the prediction; iii) to evaluate different machine learning approaches (ML) and feature selection techniques (FS).Several predictive models based on an ML algorithm, the Random Forest, were used, as well as, FS techniques. 321 nitrate samples and respective predictive features were obtained from different groundwater bodies. These predictive features were divided into three groups, regarding their focus: agricultural production (phenology); livestock pressure (excretion rates); and environmental settings (soil characteristics and texture, geomorphology, and local climate conditions). Models were trained with the features of a year [YEAR (t0)], and then applied to new features obtained for the next year &#8211; [YEAR(t0+1)], performing k-fold cross-validation. Additionally, a further prediction was carried out for a present time &#8211; [YEAR(t0+n)], validating with an independent test. This methodology examined the use of a model, trained with previous nitrates concentrations and predictive features, for the prediction of current nitrates concentrations based on present features. Our findings showed an improvement in the predictive performance when using a wrapper with sequential search for FS when compared to the use alone of the Random Forest algorithm. Phenology features, derived from remotely sensed variables, were the most explanative features, performing better than the use of static land-use maps or vegetation index images (e.g., NDVI). They also provided much more comprehensive information, and more importantly, employing only extrinsic features of groundwater bodies.

Download Full-text

Multilayer Soil Moisture Mapping at a Regional Scale from Multisource Data via a Machine Learning Method

Remote Sensing ◽

10.3390/rs11030284 ◽

2019 ◽

Vol 11 (3) ◽

pp. 284 ◽

Cited By ~ 1

Author(s):

Linglin Zeng ◽

Shun Hu ◽

Daxiang Xiang ◽

Xiang Zhang ◽

Deren Li ◽

...

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Regional Scale ◽

Remotely Sensed ◽

Temporal Variations ◽

Training Data ◽

Estimation Accuracy ◽

Learning Approaches ◽

Remotely Sensed Data ◽

Deep Soil

Soil moisture mapping at a regional scale is commonplace since these data are required in many applications, such as hydrological and agricultural analyses. The use of remotely sensed data for the estimation of deep soil moisture at a regional scale has received far less emphasis. The objective of this study was to map the 500-m, 8-day average and daily soil moisture at different soil depths in Oklahoma from remotely sensed and ground-measured data using the random forest (RF) method, which is one of the machine-learning approaches. In order to investigate the estimation accuracy of the RF method at both a spatial and a temporal scale, two independent soil moisture estimation experiments were conducted using data from 2010 to 2014: a year-to-year experiment (with a root mean square error (RMSE) ranging from 0.038 to 0.050 m3/m3) and a station-to-station experiment (with an RMSE ranging from 0.044 to 0.057 m3/m3). Then, the data requirements, importance factors, and spatial and temporal variations in estimation accuracy were discussed based on the results using the training data selected by iterated random sampling. The highly accurate estimations of both the surface and the deep soil moisture for the study area reveal the potential of RF methods when mapping soil moisture at a regional scale, especially when considering the high heterogeneity of land-cover types and topography in the study area.

Download Full-text

An Improved Machine Learning-Based Employees Attrition Prediction Framework with Emphasis on Feature Selection

Mathematics ◽

10.3390/math9111226 ◽

2021 ◽

Vol 9 (11) ◽

pp. 1226

Author(s):

Saeed Najafi-Zangeneh ◽

Naser Shams-Gharneh ◽

Ali Arjomandi-Nezhad ◽

Sarfaraz Hashemkhani Zolfani

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Standard Deviation ◽

Analytical Formula ◽

Feature Selection Method ◽

Selection Method ◽

Performance Measure ◽

Learning Approaches ◽

Training Costs ◽

Professional Employees

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.

Download Full-text

Beyond Cloud: A Fused Optic and SAR Based Solution to Monitor Crop Health

10.5194/egusphere-egu21-15589 ◽

2021 ◽

Author(s):

Brianna Pagán ◽

Adekunle Ajayi ◽

Mamadou Krouma ◽

Jyotsna Budideti ◽

Omar Tafsi

Keyword(s):

Machine Learning ◽

Satellite Imagery ◽

Cloud Cover ◽

Vegetation Index ◽

Normalized Difference Vegetation Index ◽

Learning Approaches ◽

Area Index ◽

Spatial And Temporal Scales ◽

Field Level ◽

Crop Health

The value of satellite imagery to monitor crop health in near-real time continues to exponentially grow as more missions are launched making data available at higher spatial and temporal scales. Yet cloud cover remains an issue for utilizing vegetation indexes (VIs) solely based on optic imagery, especially in certain regions and climates. Previous research has proven the ability to reconstruct VIs like the Normalized Difference Vegetation Index (NDVI) and Leaf Area Index (LAI) by leveraging synthetic aperture radar (SAR) datasets, which are not inhibited by cloud cover. Publicly available data from SAR missions like Sentinel-1 at relatively decent spatial resolutions present the opportunity for more affordable options for agriculture users to integrate satellite imagery in their day to day operations. Previous research has successfully reconstructed optic VIs (i.e. from Sentinel-2) with SAR data (i.e. from Sentinel-1) leveraging various machine learning approaches for a limited number of crop types. However, these efforts normally train on individual pixels rather than leveraging information at a field level.&#160;Here we present Beyond Cloud, a product which is the first to leverage computer vision and machine learning approaches in order to provide fused optic and SAR based crop health information. Field level learning is especially well-suited for inherently noisy SAR datasets. Several use cases are presented over agriculture fields located throughout the United Kingdom, France and Belgium, where cloud cover limits optic based solutions to as little as 2-3 images per growing season. Preliminary efforts for additional features to the product including automated crop and soil type detection are also discussed. Beyond Cloud can be accessed via a simple API which makes integration of the results easy for existing dashboards and smart-ag tools. Overall, these efforts promote the accessibility of satellite imagery for real agriculture end users.&#160;

Download Full-text

Predicting discontinuation of docetaxel treatment for metastatic castration-resistant prostate cancer (mCRPC) with random forest

F1000Research ◽

10.12688/f1000research.8353.1 ◽

2016 ◽

Vol 5 ◽

pp. 2673 ◽

Cited By ~ 1

Author(s):

Daniel Kristiyanto ◽

Kevin E. Anderson ◽

Ling-Hong Hung ◽

Ka Yee Yeung

Keyword(s):

Prostate Cancer ◽

Feature Selection ◽

Random Forest ◽

Developed Countries ◽

Hill Climbing ◽

Support Vector ◽

Castration Resistant Prostate Cancer ◽

K Nearest Neighbor ◽

Missing Data Imputation ◽

Docetaxel Treatment

Prostate cancer is the most common cancer among men in developed countries. Androgen deprivation therapy (ADT) is the standard treatment for prostate cancer. However, approximately one third of all patients with metastatic disease treated with ADT develop resistance to ADT. This condition is called metastatic castrate-resistant prostate cancer (mCRPC). Patients who do not respond to hormone therapy are often treated with a chemotherapy drug called docetaxel. Sub-challenge 2 of the Prostate Cancer DREAM Challenge aims to improve the prediction of whether a patient with mCRPC would discontinue docetaxel treatment due to adverse effects. Specifically, a dataset containing three distinct clinical studies of patients with mCRPC treated with docetaxel was provided. We applied the k-nearest neighbor method for missing data imputation, the hill climbing algorithm and random forest importance for feature selection, and the random forest algorithm for classification. We also empirically studied the performance of many classification algorithms, including support vector machines and neural networks. Additionally, we found using random forest importance for feature selection provided slightly better results than the more computationally expensive method of hill climbing.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Comparative Analysis on Machine Learning and Deep Learning to Predict Post-Induction Hypotension

Sensors ◽

10.3390/s20164575 ◽

2020 ◽

Vol 20 (16) ◽

pp. 4575 ◽

Cited By ~ 1

Author(s):

Jihyun Lee ◽

Jiyoung Woo ◽

Ah Reum Kang ◽

Young-Seob Jeong ◽

Woohyun Jung ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Feature Selection ◽

Deep Learning ◽

Random Forest ◽

Tracheal Intubation ◽

Feature Engineering ◽

Learning Models ◽

Raw Data ◽

Vital Records

Hypotensive events in the initial stage of anesthesia can cause serious complications in the patients after surgery, which could be fatal. In this study, we intended to predict hypotension after tracheal intubation using machine learning and deep learning techniques after intubation one minute in advance. Meta learning models, such as random forest, extreme gradient boosting (Xgboost), and deep learning models, especially the convolutional neural network (CNN) model and the deep neural network (DNN), were trained to predict hypotension occurring between tracheal intubation and incision, using data from four minutes to one minute before tracheal intubation. Vital records and electronic health records (EHR) for 282 of 319 patients who underwent laparoscopic cholecystectomy from October 2018 to July 2019 were collected. Among the 282 patients, 151 developed post-induction hypotension. Our experiments had two scenarios: using raw vital records and feature engineering on vital records. The experiments on raw data showed that CNN had the best accuracy of 72.63%, followed by random forest (70.32%) and Xgboost (64.6%). The experiments on feature engineering showed that random forest combined with feature selection had the best accuracy of 74.89%, while CNN had a lower accuracy of 68.95% than that of the experiment on raw data. Our study is an extension of previous studies to detect hypotension before intubation with a one-minute advance. To improve accuracy, we built a model using state-of-art algorithms. We found that CNN had a good performance, but that random forest had a better performance when combined with feature selection. In addition, we found that the examination period (data period) is also important.

Download Full-text

Computational prediction of implantation outcome after embryo transfer

Health Informatics Journal ◽

10.1177/1460458219892138 ◽

2019 ◽

Vol 26 (3) ◽

pp. 1810-1826 ◽

Cited By ~ 3

Author(s):

Behnaz Raef ◽

Masoud Maleki ◽

Reza Ferdousi

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Prediction Model ◽

Embryo Transfer ◽

Area Under The Curve ◽

Computational Prediction ◽

Support Vector ◽

Human Menopausal Gonadotropin ◽

Optimum Number ◽

Learning Approaches

The aim of this study is to develop a computational prediction model for implantation outcome after an embryo transfer cycle. In this study, information of 500 patients and 1360 transferred embryos, including cleavage and blastocyst stages and fresh or frozen embryos, from April 2016 to February 2018, were collected. The dataset containing 82 attributes and a target label (indicating positive and negative implantation outcomes) was constructed. Six dominant machine learning approaches were examined based on their performance to predict embryo transfer outcomes. Also, feature selection procedures were used to identify effective predictive factors and recruited to determine the optimum number of features based on classifiers performance. The results revealed that random forest was the best classifier (accuracy = 90.40% and area under the curve = 93.74%) with optimum features based on a 10-fold cross-validation test. According to the Support Vector Machine-Feature Selection algorithm, the ideal numbers of features are 78. Follicle stimulating hormone/human menopausal gonadotropin dosage for ovarian stimulation was the most important predictive factor across all examined embryo transfer features. The proposed machine learning-based prediction model could predict embryo transfer outcome and implantation of embryos with high accuracy, before the start of an embryo transfer cycle.

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

FeatureSelect: a software for feature selection based on machine learning approaches

BMC Bioinformatics ◽

10.1186/s12859-019-2754-0 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 26

Author(s):

Yosef Masoudi-Sobhanzadeh ◽

Habib Motieghader ◽

Ali Masoudi-Nejad

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Approaches

Download Full-text

Which PHQ-9 Items Can Effectively Screen for Suicide? Machine Learning Approaches

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18073339 ◽

2021 ◽

Vol 18 (7) ◽

pp. 3339

Author(s):

Sunhae Kim ◽

Hye-Kyung Lee ◽

Kounseok Lee

Keyword(s):

Machine Learning ◽

Primary Care ◽

Suicidal Ideation ◽

Random Forest ◽

Suicide Ideation ◽

Area Under The Curve ◽

Learning Approaches ◽

K Nearest Neighbors ◽

Predictive Values ◽

Linear Discriminant

(1) Background: The Patient Health Questionnaire-9 (PHQ-9) is a tool that screens patients for depression in primary care settings. In this study, we evaluated the efficacy of PHQ-9 in evaluating suicidal ideation (2) Methods: A total of 8760 completed questionnaires collected from college students were analyzed. The PHQ-9 was scored in combination with and evaluated against four categories (PHQ-2, PHQ-8, PHQ-9, and PHQ-10). Suicidal ideations were evaluated using the Mini-International Neuropsychiatric Interview suicidality module. Analyses used suicide ideation as the dependent variable, and machine learning (ML) algorithms, k-nearest neighbors, linear discriminant analysis (LDA), and random forest. (3) Results: Random forest application using the nine items of the PHQ-9 revealed an excellent area under the curve with a value of 0.841, with 94.3% accuracy. The positive and negative predictive values were 84.95% (95% CI = 76.03–91.52) and 95.54% (95% CI = 94.42–96.48), respectively. (4) Conclusion: This study confirmed that ML algorithms using PHQ-9 in the primary care field are reliably accurate in screening individuals with suicidal ideation.

Download Full-text