Slope Failure Prediction Using Random Forest Machine Learning and LiDAR in an Eroded Folded Mountain Belt

The probabilistic mapping of landslide occurrence at a high spatial resolution and over a large geographic extent is explored using random forests (RF) machine learning; light detection and ranging (LiDAR)-derived terrain variables; additional variables relating to lithology, soils, distance to roads and streams and cost distance to roads and streams; and training data interpreted from high spatial resolution LiDAR-derivatives. Using a large training set and all predictor variables, an area under the receiver operating characteristic (ROC) curve (AUC) of 0.946 is obtained. Our findings highlight the value of a large training dataset, the incorporation of a variety of terrain variables and the use of variable window sizes to characterize the landscape at different spatial scales. We also document important variables for mapping slope failures. Our results suggest that feature selection is not required to improve the RF modeling results and that incorporating multiple models using different pseudo absence samples is not necessary. From our findings and based on a review of prior studies, we make recommendations for high spatial resolution, large-area slope failure probabilistic mapping.

Download Full-text

Using machine learning to examine street green space types at a high spatial resolution: application in Los Angeles County on socioeconomic disparities in exposure

The Science of The Total Environment ◽

10.1016/j.scitotenv.2021.147653 ◽

2021 ◽

pp. 147653

Author(s):

Yi Sun ◽

Xingzhi Wang ◽

Jiayin Zhu ◽

Liangjian Chen ◽

Yuhang Jia ◽

...

Keyword(s):

Machine Learning ◽

Los Angeles ◽

Spatial Resolution ◽

Los Angeles County ◽

Green Space ◽

High Spatial Resolution ◽

Socioeconomic Disparities

Download Full-text

Classification of Very-High-Spatial-Resolution Aerial Images Based on Multiscale Features with Limited Semantic Information

Remote Sensing ◽

10.3390/rs13030364 ◽

2021 ◽

Vol 13 (3) ◽

pp. 364

Author(s):

Han Gao ◽

Jinhui Guo ◽

Peng Guo ◽

Xiuwan Chen

Keyword(s):

Deep Learning ◽

Land Cover ◽

Spatial Resolution ◽

Large Scale ◽

High Spatial Resolution ◽

Training Data ◽

Aerial Images ◽

Rural Landscapes ◽

Feature Representations ◽

Object Based

Recently, deep learning has become the most innovative trend for a variety of high-spatial-resolution remote sensing imaging applications. However, large-scale land cover classification via traditional convolutional neural networks (CNNs) with sliding windows is computationally expensive and produces coarse results. Additionally, although such supervised learning approaches have performed well, collecting and annotating datasets for every task are extremely laborious, especially for those fully supervised cases where the pixel-level ground-truth labels are dense. In this work, we propose a new object-oriented deep learning framework that leverages residual networks with different depths to learn adjacent feature representations by embedding a multibranch architecture in the deep learning pipeline. The idea is to exploit limited training data at different neighboring scales to make a tradeoff between weak semantics and strong feature representations for operational land cover mapping tasks. We draw from established geographic object-based image analysis (GEOBIA) as an auxiliary module to reduce the computational burden of spatial reasoning and optimize the classification boundaries. We evaluated the proposed approach on two subdecimeter-resolution datasets involving both urban and rural landscapes. It presented better classification accuracy (88.9%) compared to traditional object-based deep learning methods and achieves an excellent inference time (11.3 s/ha).

Download Full-text

Plant drought impact detection using ultra-high spatial resolution hyperspectral images and machine learning

International Journal of Applied Earth Observation and Geoinformation ◽

10.1016/j.jag.2021.102364 ◽

2021 ◽

Vol 102 ◽

pp. 102364

Author(s):

Phuong D. Dao ◽

Yuhong He ◽

Cameron Proctor

Keyword(s):

Machine Learning ◽

Spatial Resolution ◽

High Spatial Resolution ◽

Hyperspectral Images ◽

Drought Impact ◽

Impact Detection

Download Full-text

Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data

Remote Sensing ◽

10.3390/rs13030368 ◽

2021 ◽

Vol 13 (3) ◽

pp. 368

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell ◽

Bradley S. Price

Keyword(s):

Machine Learning ◽

Sample Size ◽

Remotely Sensed ◽

Training Data ◽

Supervised Machine Learning ◽

Sample Sizes ◽

Remotely Sensed Data ◽

Large Area ◽

Training Set ◽

Set Size

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

Improving 3-m Resolution Land Cover Mapping through Efficient Learning from an Imperfect 10-m Resolution Map

Remote Sensing ◽

10.3390/rs12091418 ◽

2020 ◽

Vol 12 (9) ◽

pp. 1418

Author(s):

Runmin Dong ◽

Cong Li ◽

Haohuan Fu ◽

Jie Wang ◽

Weijia Li ◽

...

Keyword(s):

Land Cover ◽

Training Data ◽

Training Dataset ◽

Land Cover Mapping ◽

Remotely Sensed Data ◽

Large Area ◽

National Scale ◽

Substantial Progress ◽

Efficient Learning ◽

Land Cover Maps

Substantial progress has been made in the field of large-area land cover mapping as the spatial resolution of remotely sensed data increases. However, a significant amount of human power is still required to label images for training and testing purposes, especially in high-resolution (e.g., 3-m) land cover mapping. In this research, we propose a solution that can produce 3-m resolution land cover maps on a national scale without human efforts being involved. First, using the public 10-m resolution land cover maps as an imperfect training dataset, we propose a deep learning based approach that can effectively transfer the existing knowledge. Then, we improve the efficiency of our method through a network pruning process for national-scale land cover mapping. Our proposed method can take the state-of-the-art 10-m resolution land cover maps (with an accuracy of 81.24% for China) as the training data, enable a transferred learning process that can produce 3-m resolution land cover maps, and further improve the overall accuracy (OA) to 86.34% for China. We present detailed results obtained over three mega cities in China, to demonstrate the effectiveness of our proposed approach for 3-m resolution large-area land cover mapping.

Download Full-text

Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery

10.5194/egusphere-egu21-6853 ◽

2021 ◽

Author(s):

Rudy Venguswamy ◽

Mike Levy ◽

Anirudh Koul ◽

Satyarth Praveen ◽

Tarun Narayanan ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Open Source ◽

Forest Fires ◽

Seed Set ◽

Training Data ◽

Training Dataset ◽

Reference Image ◽

Query Image ◽

Real World Datasets

Machine learning modeling for Earth events at NASA is often limited by the availability of labeled examples. For example, training classifiers for forest fires or oil spills from satellite imagery requires curating a massive and diverse dataset of example forest fires, a tedious multi-month effort requiring careful review of over 196.9 million square miles of data per day for 20 years. While such images might exist in abundance within 40 petabytes of unlabeled satellite data, finding these positive examples to include in a training dataset for a machine learning model is extremely time-consuming and requires researchers to "hunt" for positive examples, like finding a needle in a haystack.&#160;We present a no-code open-source tool, Curator, whose goal is to minimize the amount of human manual image labeling needed to achieve a state of the art classifier. The pipeline, purpose-built to take advantage of the massive amount of unlabeled images, consists of (1) self-supervision training to convert unlabeled images into meaningful representations, (2) search-by-example to collect a seed set of images, (3) human-in-the-loop active learning to iteratively ask for labels on uncertain examples and train on them.&#160;In step 1, a model capable of representing unlabeled images meaningfully is trained with a self-supervised algorithm (like SimCLR) on a random subset of the dataset (that conforms to researchers&#8217; specified &#8220;training budget.&#8221;). Since real-world datasets are often imbalanced leading to suboptimal models, the initial model is used to generate embeddings on the entire dataset. Then, images with equidistant embeddings are sampled. This iterative training and resampling strategy improves both balanced training data and models every iteration. In step 2, researchers supply an example image of interest, and the output embeddings generated from this image are used to find other images with embeddings near the reference image&#8217;s embedding in euclidean space (hence similar looking images to the query image). These proposed candidate images contain a higher density of positive examples and are annotated manually as a seed set. In step 3, the seed labels are used to train a classifier to identify more candidate images for human inspection with active learning. Each classification training loop, candidate images for labeling are sampled from the larger unlabeled dataset based on the images that the model is most uncertain about (p &#8776; 0.5).Curator is released as an open-source package built on PyTorch-Lightning. The pipeline uses GPU-based transforms from the NVIDIA-Dali package for augmentation, leading to a 5-10x speed up in self-supervised training and is run from the command line.By iteratively training a self-supervised model and a classifier in tandem with human manual annotation, this pipeline is able to unearth more positive examples from severely imbalanced datasets which were previously untrainable with self-supervision algorithms. In applications such as detecting wildfires, atmospheric dust, or turning outward with telescopic surveys, increasing the number of positive candidates presented to humans for manual inspection increases the efficacy of classifiers and multiplies the efficiency of researchers&#8217; data curation efforts.

Download Full-text

A machine learning framework to determine geolocations from metagenomic profiling

Biology Direct ◽

10.1186/s13062-020-00278-z ◽

2020 ◽

Vol 15 (1) ◽

Cited By ~ 1

Author(s):

Lihong Huang ◽

Canqiang Xu ◽

Wenxian Yang ◽

Rongshan Yu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Geographic Origin ◽

Training Data ◽

Metagenomic Data ◽

Training Dataset ◽

Kriging Interpolation ◽

Learning Framework ◽

Testing Data ◽

Microbial Samples

Abstract Background Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Results Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Conclusion Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.

Download Full-text

The application of object based analysis of high spatial resolution imagery for mapping large coral reef systems in the West Pacific at geomorphic and benthic community spatial scales

2010 IEEE International Geoscience and Remote Sensing Symposium ◽

10.1109/igarss.2010.5654126 ◽

2010 ◽

Cited By ~ 10

Author(s):

Chris Roelfsema ◽

Stuart Phinn ◽

Stacy Jupiter ◽

James Comley ◽

Marie Beger ◽

...

Keyword(s):

Coral Reef ◽

Spatial Resolution ◽

Benthic Community ◽

High Spatial Resolution ◽

Spatial Scales ◽

The West ◽

West Pacific ◽

Object Based ◽

Large Coral ◽

High Spatial Resolution Imagery

Download Full-text

Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification

Remote Sensing ◽

10.3390/rs11020185 ◽

2019 ◽

Vol 11 (2) ◽

pp. 185 ◽

Cited By ~ 20

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell

Keyword(s):

Machine Learning ◽

Monte Carlo ◽

Spatial Resolution ◽

Cross Validation ◽

High Spatial Resolution ◽

Regional Scale ◽

Sample Selection ◽

Selection Methods ◽

Machine Learning Classification ◽

Leave One Out

High spatial resolution (1–5 m) remotely sensed datasets are increasingly being used to map land covers over large geographic areas using supervised machine learning algorithms. Although many studies have compared machine learning classification methods, sample selection methods for acquiring training and validation data for machine learning, and cross-validation techniques for tuning classifier parameters are rarely investigated, particularly on large, high spatial resolution datasets. This work, therefore, examines four sample selection methods—simple random, proportional stratified random, disproportional stratified random, and deliberative sampling—as well as three cross-validation tuning approaches—k-fold, leave-one-out, and Monte Carlo methods. In addition, the effect on the accuracy of localizing sample selections to a small geographic subset of the entire area, an approach that is sometimes used to reduce costs associated with training data collection, is investigated. These methods are investigated in the context of support vector machines (SVM) classification and geographic object-based image analysis (GEOBIA), using high spatial resolution National Agricultural Imagery Program (NAIP) orthoimagery and LIDAR-derived rasters, covering a 2,609 km2 regional-scale area in northeastern West Virginia, USA. Stratified-statistical-based sampling methods were found to generate the highest classification accuracy. Using a small number of training samples collected from only a subset of the study area provided a similar level of overall accuracy to a sample of equivalent size collected in a dispersed manner across the entire regional-scale dataset. There were minimal differences in accuracy for the different cross-validation tuning methods. The processing time for Monte Carlo and leave-one-out cross-validation were high, especially with large training sets. For this reason, k-fold cross-validation appears to be a good choice. Classifications trained with samples collected deliberately (i.e., not randomly) were less accurate than classifiers trained from statistical-based samples. This may be due to the high positive spatial autocorrelation in the deliberative training set. Thus, if possible, samples for training should be selected randomly; deliberative samples should be avoided.

Download Full-text