The effect of sample size on different machine learning models for groundwater potential mapping in mountain bedrock aquifers

Mapping of groundwater potential in remote arid and semi-arid regions underneath sand sheets over a very regional scale is a challenge and requires an accurate classifier. The Classification and Regression Trees (CART) model is a robust machine learning classifier used in groundwater potential mapping over a very regional scale. Ten essential groundwater conditioning factors (GWCFs) were constructed using remote sensing data. The spatial relationship between these conditioning factors and the observed groundwater wells locations was optimized and identified by using the chi-square method. A total of 185 groundwater well locations were randomly divided into 129 (70%) for training the model and 56 (30%) for validation. The model was applied for groundwater potential mapping by using optimal parameters values for additive trees were 186, the value for the learning rate was 0.1, and the maximum size of the tree was five. The validation result demonstrated that the area under the curve (AUC) of the CART was 0.920, which represents a predictive accuracy of 92%. The resulting map demonstrated that the depressions of Mondafan, Khujaymah and Wajid Mutaridah depression and the southern gulf salt basin (SGSB) near Saudi Arabia, Oman and the United Arab Emirates (UAE) borders reserve fresh fossil groundwater as indicated from the observed lakes and recovered paleolakes. The proposed model and the new maps are effective at enhancing the mapping of groundwater potential over a very regional scale obtained using machine learning algorithms, which are used rarely in the literature and can be applied to the Sahara and the Kalahari Desert.

Download Full-text

The influence of training sample size on the accuracy of deep learning models for the prediction of soil properties with near-infrared spectroscopy data

SOIL ◽

10.5194/soil-6-565-2020 ◽

2020 ◽

Vol 6 (2) ◽

pp. 565-578

Author(s):

Wartini Ng ◽

Budiman Minasny ◽

Wanderson de Sousa Mendes ◽

José Alexandre Melo Demattê

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Soil Properties ◽

Sample Size ◽

Training Sample ◽

Calibration Data ◽

Learning Models ◽

Data Set ◽

Calibration Data Set ◽

Machine Learning Models

Abstract. The number of samples used in the calibration data set affects the quality of the generated predictive models using visible, near and shortwave infrared (VIS–NIR–SWIR) spectroscopy for soil attributes. Recently, the convolutional neural network (CNN) has been regarded as a highly accurate model for predicting soil properties on a large database. However, it has not yet been ascertained how large the sample size should be for CNN model to be effective. This paper investigates the effect of the training sample size on the accuracy of deep learning and machine learning models. It aims at providing an estimate of how many calibration samples are needed to improve the model performance of soil properties predictions with CNN as compared to conventional machine learning models. In addition, this paper also looks at a way to interpret the CNN models, which are commonly labelled as a black box. It is hypothesised that the performance of machine learning models will increase with an increasing number of training samples, but it will plateau when it reaches a certain number, while the performance of CNN will keep improving. The performances of two machine learning models (partial least squares regression – PLSR; Cubist) are compared against the CNN model. A VIS–NIR–SWIR spectra library from Brazil, containing 4251 unique sites with averages of two to three samples per depth (a total of 12 044 samples), was divided into calibration (3188 sites) and validation (1063 sites) sets. A subset of the calibration data set was then created to represent a smaller calibration data set ranging from 125, 300, 500, 1000, 1500, 2000, 2500 and 2700 unique sites, which is equivalent to a sample size of approximately 350, 840, 1400, 2800, 4200, 5600, 7000 and 7650. All three models (PLSR, Cubist and CNN) were generated for each sample size of the unique sites for the prediction of five different soil properties, i.e. cation exchange capacity, organic carbon, sand, silt and clay content. These calibration subset sampling processes and modelling were repeated 10 times to provide a better representation of the model performances. Learning curves showed that the accuracy increased with an increasing number of training samples. At a lower number of samples (< 1000), PLSR and Cubist performed better than CNN. The performance of CNN outweighed the PLSR and Cubist model at a sample size of 1500 and 1800, respectively. It can be recommended that deep learning is most efficient for spectra modelling for sample sizes above 2000. The accuracy of the PLSR and Cubist model seems to reach a plateau above sample sizes of 4200 and 5000, respectively, while the accuracy of CNN has not plateaued. A sensitivity analysis of the CNN model demonstrated its ability to determine important wavelengths region that affected the predictions of various soil attributes.

Download Full-text

Quadratic Discriminant Analysis Based Ensemble Machine Learning Models for Groundwater Potential Modeling and Mapping

Water Resources Management ◽

10.1007/s11269-021-02957-6 ◽

2021 ◽

Author(s):

Duong Hai Ha ◽

Phong Tung Nguyen ◽

Romulus Costache ◽

Nadhir Al-Ansari ◽

Tran Van Phong ◽

...

Keyword(s):

Machine Learning ◽

Discriminant Analysis ◽

Groundwater Potential ◽

Quadratic Discriminant Analysis ◽

Learning Models ◽

Ensemble Machine Learning ◽

Machine Learning Models

Download Full-text

Characterizing Groundwater Potential Using GIS-Based Machine Learning Model in Chihe River Basin, China

10.21203/rs.3.rs-1044219/v1 ◽

2021 ◽

Author(s):

Dejian Wang ◽

Jiazhong Qian ◽

Lei Ma ◽

Weidong Zhao ◽

Di Gao ◽

...

Keyword(s):

Machine Learning ◽

Water Resources ◽

River Basin ◽

Environmental Variables ◽

Groundwater Potential ◽

High Potential ◽

Slope Aspect ◽

Learning Models ◽

Topographic Wetness Index ◽

Machine Learning Models

Abstract Mapping of groundwater potential over space, built by synergizing environmental variables and machine learning models, was of great significance for regional water resources management. Taking the Chihe River basin in Anhui province as an example, thirteen influence factors were used to predict the spatial distribution of groundwater, including elevation, slope, aspect, plan curvature, profile curvature, topographic wetness index (TWI), drainage density, distance to rivers, distance to faults, lithology, soil type, land use, and normalized difference vegetation index (NDVI). The potential of groundwater resource in this region was predicted using GIS-based machine learning models, including logistic regression (LR), deep neural networks (DNN), and random forest (RF) model. Then, the accuracy of prediction results was evaluated by calculating the RMSE, MAE and R evaluation index. The results show that there is no collinearity among the 13 environmental impact factors, which can provide corresponding environmental variables for the evaluation of regional groundwater potential. Machine learning models show that groundwater potential is concentrated in moderate to high potential areas. Among them, the moderate to the high potential of this area accounted for 81.14% in the LR model, 90.36% and 87.55% in the DNN model and the RF model, respectively. According to the result of these evaluation indexes, the three models all have high prediction accuracy, among which the LR model performs more prominently. The good prediction capabilities of these machine learning technologies can provide a reliable scientific basis for spatial prediction of groundwater potential and management of water resources.

Download Full-text