Estimation of effective calibration sample size using visible near infrared spectroscopy: deep learning vs machine learning
Abstract. The number of samples used in the calibration dataset affects the quality of the generated predictive models using visible, near and shortwave infrared (VIS-NIR-SWIR) spectroscopy for soil attributes. Recently, convolutional neural network (CNN) is regarded as a highly accurate model for predicting soil properties on a large database, however it has not been ascertained yet how large the sample size should be for CNN model to be effective. This paper aims at providing an estimate of how much calibration samples are needed to improve the model performance of soil properties predictions with CNN. It is hypothesized that the larger the amount of data, the more accurate is the CNN model. The performance of two commonly used machine learning models (Partial least squares regression (PLSR) and Cubist) are compared against the CNN model. A VIS-NIR-SWIR spectral library from Brazil containing 4251 unique sites, with averages of 2–3 samples per depth (a total of 12,044 samples), was divided into calibration (3188 sites) and validation (1063 sites) sets. A subset of the calibration dataset was then created to represent smaller calibration dataset ranging from 125, 300, 500, 1000, 1500, 2000, 2500 and 2700 unique sites, or equivalent to sample size approximately 350, 840, 1400, 2800, 4200, 5600, 7000, and 7650. All three models (PLSR, Cubist, and CNN models) were generated for each sample size of the unique sites for the prediction of five different soil properties, i.e. cation exchange capacity, organic matter, sand, silt and clay content. These calibration subset sampling processes and modelling were repeated ten times to provide a better representation of the model performances. Similar results were observed when the performances of both PLSR and Cubist model were compared to the CNN model where the performance of CNN outweighed the PLSR and Cubist model at sample size of 1500 and 1800 respectively. It can be recommended that deep learning is most efficient for spectral modelling for sample size above 2000. The accuracy of the PLSR and Cubist model seemed to reach a plateau above sample size of 4200 and 5000 respectively. A sensitivity analysis was performed on the CNN model to determine important wavelengths region that affected the predictions of various soil attributes.