Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

Multilayer Soil Moisture Mapping at a Regional Scale from Multisource Data via a Machine Learning Method

Remote Sensing ◽

10.3390/rs11030284 ◽

2019 ◽

Vol 11 (3) ◽

pp. 284 ◽

Cited By ~ 1

Author(s):

Linglin Zeng ◽

Shun Hu ◽

Daxiang Xiang ◽

Xiang Zhang ◽

Deren Li ◽

...

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Regional Scale ◽

Remotely Sensed ◽

Temporal Variations ◽

Training Data ◽

Estimation Accuracy ◽

Learning Approaches ◽

Remotely Sensed Data ◽

Deep Soil

Soil moisture mapping at a regional scale is commonplace since these data are required in many applications, such as hydrological and agricultural analyses. The use of remotely sensed data for the estimation of deep soil moisture at a regional scale has received far less emphasis. The objective of this study was to map the 500-m, 8-day average and daily soil moisture at different soil depths in Oklahoma from remotely sensed and ground-measured data using the random forest (RF) method, which is one of the machine-learning approaches. In order to investigate the estimation accuracy of the RF method at both a spatial and a temporal scale, two independent soil moisture estimation experiments were conducted using data from 2010 to 2014: a year-to-year experiment (with a root mean square error (RMSE) ranging from 0.038 to 0.050 m3/m3) and a station-to-station experiment (with an RMSE ranging from 0.044 to 0.057 m3/m3). Then, the data requirements, importance factors, and spatial and temporal variations in estimation accuracy were discussed based on the results using the training data selected by iterated random sampling. The highly accurate estimations of both the surface and the deep soil moisture for the study area reveal the potential of RF methods when mapping soil moisture at a regional scale, especially when considering the high heterogeneity of land-cover types and topography in the study area.

Download Full-text

Impact of Training Sample Size on the Effects of Regularization in a Convolutional Neural Network-based Dental X-ray Artifact Prediction Model

Journal of Undergraduate Life Sciences ◽

10.33137/juls.v14i1.35883 ◽

2020 ◽

Vol 14 (1) ◽

pp. 5

Author(s):

Adam Adli ◽

Pascal Tyrrell

Keyword(s):

Neural Network ◽

Machine Learning ◽

Convolutional Neural Network ◽

Sample Size ◽

Training Sample ◽

Training Data ◽

Classification Model ◽

Sample Sizes ◽

X Ray ◽

Training Sample Size

Introduction: Advances in computers have allowed for the practical application of increasingly advanced machine learning models to aid healthcare providers with diagnosis and inspection of medical images. Often, a lack of training data and computation time can be a limiting factor in the development of an accurate machine learning model in the domain of medical imaging. As a possible solution, this study investigated whether L2 regularization moderate s the overfitting that occurs as a result of small training sample sizes.Methods: This study employed transfer learning experiments on a dental x-ray binary classification model to explore L2 regularization with respect to training sample size in five common convolutional neural network architectures. Model testing performance was investigated and technical implementation details including computation times and hardware considerations as well as performance factors and practical feasibility were described.Results: The experimental results showed a trend that smaller training sample sizes benefitted more from regularization than larger training sample sizes. Further, the results showed that applying L2 regularization did not apply significant computational overhead and that the extra rounds of training L2 regularization were feasible when training sample sizes are relatively small.Conclusion: Overall, this study found that there is a window of opportunity in which the benefits of employing regularization can be most cost-effective relative to training sample size. It is recommended that training sample size should be carefully considered when forming expectations of achievable generalizability improvements that result from investing computational resources into model regularization.

Download Full-text

Integrating Remote Sensing, Machine Learning, and Citizen Science in Dutch Archaeological Prospection

Remote Sensing ◽

10.3390/rs11070794 ◽

2019 ◽

Vol 11 (7) ◽

pp. 794 ◽

Cited By ~ 27

Author(s):

Karsten Lambers ◽

Wouter Verschoof-van der Vaart ◽

Quentin Bourgeois

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Object Detection ◽

Citizen Science ◽

Remotely Sensed ◽

Training Data ◽

Remotely Sensed Data ◽

Study Region ◽

Archaeological Prospection ◽

Archaeological Object

Although the history of automated archaeological object detection in remotely sensed data is short, progress and emerging trends are evident. Among them, the shift from rule-based approaches towards machine learning methods is, at the moment, the cause for high expectations, even though basic problems, such as the lack of suitable archaeological training data are only beginning to be addressed. In a case study in the central Netherlands, we are currently developing novel methods for multi-class archaeological object detection in LiDAR data based on convolutional neural networks (CNNs). This research is embedded in a long-term investigation of the prehistoric landscape of our study region. We here present an innovative integrated workflow that combines machine learning approaches to automated object detection in remotely sensed data with a two-tier citizen science project that allows us to generate and validate detections of hitherto unknown archaeological objects, thereby contributing to the creation of reliable, labeled archaeological training datasets. We motivate our methodological choices in the light of current trends in archaeological prospection, remote sensing, machine learning, and citizen science, and present the first results of the implementation of the workflow in our research area.

Download Full-text

The Classification of Noise-Afflicted Remotely Sensed Data Using Three Machine-Learning Techniques: Effect of Different Levels and Types of Noise on Accuracy

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi7070274 ◽

2018 ◽

Vol 7 (7) ◽

pp. 274 ◽

Cited By ~ 6

Author(s):

Sornkitja Boonprong ◽

Chunxiang Cao ◽

Wei Chen ◽

Xiliang Ni ◽

Min Xu ◽

...

Keyword(s):

Machine Learning ◽

Satellite Image ◽

Back Propagation ◽

Speckle Noise ◽

Remotely Sensed ◽

Back Propagation Neural Network ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

Remotely Sensed Data

Remotely sensed data are often adversely affected by many types of noise, which influences the classification result. Supervised machine-learning (ML) classifiers such as random forest (RF), support vector machine (SVM), and back-propagation neural network (BPNN) are broadly reported to improve robustness against noise. However, only a few comparative studies that may help investigate this robustness have been reported. An important contribution, going beyond previous studies, is that we perform the analyses by employing the most well-known and broadly implemented packages of the three classifiers and control their settings to represent users’ actual applications. This facilitates an understanding of the extent to which the noise types and levels in remotely sensed data impact classification accuracy using ML classifiers. By using those implementations, we classified the land cover data from a satellite image that was separately afflicted by seven-level zero-mean Gaussian, salt–pepper, and speckle noise. The modeling data and features were strictly controlled. Finally, we discussed how each noise type affects the accuracy obtained from each classifier and the robustness of the classifiers to noise in the data. This may enhance our understanding of the relationship between noises, the supervised ML classifiers, and remotely sensed data.

Download Full-text

Object-Based Supervised Machine Learning Regional-Scale Land-Cover Classification Using High Resolution Remotely Sensed Data

10.33915/etd.3876 ◽

2019 ◽

Author(s):

Christopher A Ramezan

Keyword(s):

Machine Learning ◽

High Resolution ◽

Land Cover ◽

Regional Scale ◽

Land Cover Classification ◽

Remotely Sensed ◽

Supervised Machine Learning ◽

Remotely Sensed Data ◽

Object Based

Download Full-text

Training Data Distribution Significantly Impacts the Estimation of Tissue Microstructure with Machine Learning

10.1101/2021.04.13.439659 ◽

2021 ◽

Author(s):

Noemi G. Gyori ◽

Marco Palombo ◽

Christopher A. Clark ◽

Hui Zhang ◽

Daniel C. Alexander

Keyword(s):

Machine Learning ◽

High Precision ◽

Model Fitting ◽

Training Data ◽

Supervised Machine Learning ◽

Parameter Estimates ◽

Traditional Model ◽

Similar Data ◽

Training Set ◽

Accuracy And Precision

AbstractPurposeSupervised machine learning (ML) provides a compelling alternative to traditional model fitting for parameter mapping in quantitative MRI. The aim of this work is to demonstrate and quantify the effect of different training strategies on the accuracy and precision of parameter estimates when supervised ML is used for fitting.MethodsWe fit a two-compartment biophysical model to diffusion measurements from in-vivo human brain, as well as simulated diffusion data, using both traditional model fitting and supervised ML. For supervised ML, we train several artificial neural networks, as well as random forest regressors, on different distributions of ground truth parameters. We compare the accuracy and precision of parameter estimates obtained from the different estimation approaches using synthetic test data.ResultsWhen the distribution of parameter combinations in the training set matches those observed in similar data sets, we observe high precision, but inaccurate estimates for atypical parameter combinations. In contrast, when training data is sampled uniformly from the entire plausible parameter space, estimates tend to be more accurate for atypical parameter combinations but may have lower precision for typical parameter combinations.ConclusionThis work highlights the need to consider the choice of training data when deploying supervised ML for estimating microstructural metrics, as performance depends strongly on the training-set distribution. We show that high precision obtained using ML may mask strong bias, and visual assessment of the parameter maps is not sufficient for evaluating the quality of the estimates.

Download Full-text

Sample size for ground and remotely sensed data

Remote Sensing of Environment ◽

10.1016/0034-4257(86)90012-x ◽

1986 ◽

Vol 20 (1) ◽

pp. 31-41 ◽

Cited By ~ 91

Author(s):

P.J. Curran ◽

H.D. Williamson

Keyword(s):

Sample Size ◽

Remotely Sensed ◽

Remotely Sensed Data

Download Full-text

Improving 3-m Resolution Land Cover Mapping through Efficient Learning from an Imperfect 10-m Resolution Map

Remote Sensing ◽

10.3390/rs12091418 ◽

2020 ◽

Vol 12 (9) ◽

pp. 1418

Author(s):

Runmin Dong ◽

Cong Li ◽

Haohuan Fu ◽

Jie Wang ◽

Weijia Li ◽

...

Keyword(s):

Land Cover ◽

Training Data ◽

Training Dataset ◽

Land Cover Mapping ◽

Remotely Sensed Data ◽

Large Area ◽

National Scale ◽

Substantial Progress ◽

Efficient Learning ◽

Land Cover Maps

Substantial progress has been made in the field of large-area land cover mapping as the spatial resolution of remotely sensed data increases. However, a significant amount of human power is still required to label images for training and testing purposes, especially in high-resolution (e.g., 3-m) land cover mapping. In this research, we propose a solution that can produce 3-m resolution land cover maps on a national scale without human efforts being involved. First, using the public 10-m resolution land cover maps as an imperfect training dataset, we propose a deep learning based approach that can effectively transfer the existing knowledge. Then, we improve the efficiency of our method through a network pruning process for national-scale land cover mapping. Our proposed method can take the state-of-the-art 10-m resolution land cover maps (with an accuracy of 81.24% for China) as the training data, enable a transferred learning process that can produce 3-m resolution land cover maps, and further improve the overall accuracy (OA) to 86.34% for China. We present detailed results obtained over three mega cities in China, to demonstrate the effectiveness of our proposed approach for 3-m resolution large-area land cover mapping.

Download Full-text

Realized and potential efficiency for post-stratified estimation in a national forest inventory

Canadian Journal of Forest Research ◽

10.1139/cjfr-2020-0379 ◽

2021 ◽

Author(s):

James A. Westfall ◽

Andrew J. Lister ◽

John W. Coulston ◽

Ronald E. McRoberts

Keyword(s):

Remotely Sensed ◽

National Forest Inventory ◽

Data Types ◽

Remotely Sensed Data ◽

Large Area ◽

Forest Inventories ◽

Temporal Misalignment ◽

Total Tree ◽

Potential Efficiency ◽

Post Stratification

Post-stratification is often used to increase the precision of estimates arising from large-area forest inventories with plots established at permanent locations. Remotely sensed data and associated spatial products are often used for developing the post-stratification, which offers a mechanism to increase precision for less cost than increasing the sample size. While important variance reductions have been shown from post-stratification, it remains unknown where observed gains lie along the continuum of possible gains. This information is needed to determine whether efforts to further improve post-stratification outcomes are warranted. In this study, two types of ‘optimal’ post-stratification were compared to typical production-based post-stratifications to estimate the magnitude of remaining gains possible. Although the ‘optimal’ post-stratifications were derived using methods inappropriate for operational usage, the results indicated that substantial further increases in precision for estimates of both forest area and total tree biomass could be obtained with better post-stratifications. The potential gains differed by the attribute being estimated, the population being studied, and the number of strata. Practitioners seeking to optimize post-stratification face challenges such as evaluation of numerous auxiliary data sources, temporal misalignment between plot observations and remotely sensed data acquisition, and spatial misalignment between plot locations and remotely sensed data due to positional errors in both data types.

Download Full-text