A generalizable and accessible approach to machine learning with global satellite imagery

AbstractCombining satellite imagery with machine learning (SIML) has the potential to address global challenges by remotely estimating socioeconomic and environmental conditions in data-poor regions, yet the resource requirements of SIML limit its accessibility and use. We show that a single encoding of satellite imagery can generalize across diverse prediction tasks (e.g., forest cover, house price, road length). Our method achieves accuracy competitive with deep neural networks at orders of magnitude lower computational cost, scales globally, delivers label super-resolution predictions, and facilitates characterizations of uncertainty. Since image encodings are shared across tasks, they can be centrally computed and distributed to unlimited researchers, who need only fit a linear regression to their own ground truth data in order to achieve state-of-the-art SIML performance.

Download Full-text

A Combined Machine Learning and Residual Analysis Approach for Improved Retrieval of Shallow Bathymetry from Hyperspectral Imagery and Sparse Ground Truth Data

Remote Sensing ◽

10.3390/rs12213489 ◽

2020 ◽

Vol 12 (21) ◽

pp. 3489

Author(s):

Evangelos Alevizos

Keyword(s):

Machine Learning ◽

Satellite Imagery ◽

Hyperspectral Imagery ◽

Linear Regression Analysis ◽

Visible Spectrum ◽

Ground Truth ◽

Landsat 8 ◽

Optical Remote Sensing ◽

Ground Truth Data ◽

Single Beam

Mapping shallow bathymetry by means of optical remote sensing has been a challenging task of growing interest in recent years. Particularly, many studies exploit earlier empirical models together with the latest multispectral satellite imagery (e.g., Sentinel 2, Landsat 8). However, in these studies, the accuracy of resulting bathymetry is (a) limited for deeper waters (>15 m) and/or (b) is being influenced by seafloor type albedo. This study explores further the capabilities of hyperspectral satellite imagery (Hyperion), which provides several spectral bands in the visible spectrum, along with existing reference bathymetry. Bathymetry predictors are created by applying the semi-empirical approach of band ratios on hyperspectral imagery. Then, these predictors are fed to machine learning regression algorithms for predicting bathymetry. Algorithm performance is being further compared to bathymetry predictions from multiple linear regression analysis. Following the initial predictions, the residual bathymetry values are interpolated by applying the Ordinary Kriging method. Then, the predicted bathymetry from all three algorithms along with their associated residual grids is used as predictors at a second processing stage. Validation results show that by using a second stage of processing, the root-mean-square error values of predicted bathymetry is being improved by ≈1 m even for deeper water (up to 25 m). It is suggested that this approach is suitable for (a) contributing wide-scale, high-resolution shallow bathymetry toward the goals of the Seabed 2030 program and (b) as a coarse resolution alternative to effort-consuming single-beam sonar or costly airborne bathymetric laser surveying.

Download Full-text

Nanosecond Photodynamics Simulations of a Cis-Trans Isomerization Are Enabled by Machine Learning

10.26434/chemrxiv.13047863 ◽

2020 ◽

Author(s):

Jingbai Li ◽

Patrick Reiser ◽

André Eberhard ◽

Pascal Friederich ◽

Steven Lopez

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Excited State ◽

Adaptive Sampling ◽

Computational Cost ◽

Ground Truth ◽

Absolute Error ◽

Photochemical Reactions ◽

Computational Techniques ◽

Full Potential

Photochemical reactions are being increasingly used to construct complex molecular architectures with mild and straightforward reaction conditions. Computational techniques are increasingly important to understand the reactivities and chemoselectivities of photochemical isomerization reactions because they offer molecular bonding information along the excited-state(s) of photodynamics. These photodynamics simulations are resource-intensive and are typically limited to 1–10 picoseconds and 1,000 trajectories due to high computational cost. Most organic photochemical reactions have excited-state lifetimes exceeding 1 picosecond, which places them outside possible computational studies. Westermeyr et al. demonstrated that a machine learning approach could significantly lengthen photodynamics simulation times for a model system, methylenimmonium cation (CH2NH2+).We have developed a Python-based code, Python Rapid Artificial Intelligence Ab Initio Molecular Dynamics (PyRAI2MD), to accomplish the unprecedented 10 ns cis-trans photodynamics of trans-hexafluoro-2-butene (CF3–CH=CH–CF3) in 3.5 days. The same simulation would take approximately 58 years with ground-truth multiconfigurational dynamics. We proposed an innovative scheme combining Wigner sampling, geometrical interpolations, and short-time quantum chemical trajectories to effectively sample the initial data, facilitating the adaptive sampling to generate an informative and data-efficient training set with 6,232 data points. Our neural networks achieved chemical accuracy (mean absolute error of 0.032 eV). Our 4,814 trajectories reproduced the S1 half-life (60.5 fs), the photochemical product ratio (trans: cis = 2.3: 1), and autonomously discovered a pathway towards a carbene. The neural networks have also shown the capability of generalizing the full potential energy surface with chemically incomplete data (trans → cis but not cis → trans pathways) that may offer future automated photochemical reaction discoveries.

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

CircConv: A Structured Convolution with Low Complexity

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014287 ◽

2019 ◽

Vol 33 ◽

pp. 4287-4294

Author(s):

Siyu Liao ◽

Bo Yuan

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Neural Networks ◽

Computational Cost ◽

Low Complexity ◽

Deep Convolutional Neural Networks ◽

Significant Saving ◽

Machine Learning Applications ◽

Fast Multiplication ◽

Large Model

Deep neural networks (DNNs), especially deep convolutional neural networks (CNNs), have emerged as the powerful technique in various machine learning applications. However, the large model sizes of DNNs yield high demands on computation resource and weight storage, thereby limiting the practical deployment of DNNs. To overcome these limitations, this paper proposes to impose the circulant structure to the construction of convolutional layers, and hence leads to circulant convolutional layers (CircConvs) and circulant CNNs. The circulant structure and models can be either trained from scratch or re-trained from a pre-trained non-circulant model, thereby making it very flexible for different training environments. Through extensive experiments, such strong structureimposing approach is proved to be able to substantially reduce the number of parameters of convolutional layers and enable significant saving of computational cost by using fast multiplication of the circulant tensor.

Download Full-text

General Discussion

Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences ◽

10.1098/rsta.1983.0041 ◽

1983 ◽

Vol 309 (1508) ◽

pp. 283-284

Keyword(s):

Satellite Imagery ◽

Landsat Imagery ◽

Ground Truth ◽

Remotely Sensed ◽

Individual Plant ◽

Ground Truth Data ◽

Remotely Sensed Imagery ◽

Effective Evaluation ◽

Geographical Environment ◽

Multispectral Satellite Imagery

Monica M. Cole (Bedford College, London, U. K.). In contributing to a discussion of the use of multispectral satellite imagery in the exploration for petroleum and minerals covered by Mr Peters I wish to emphasize four points, some of which are relevant also to statements made by Dr Curran in his presentation. The first point is that remotely sensed imagery is a tool and its interpretation a technique to be used as appropriate and integrated with other techniques in mineral exploration. Mr Peters has reviewed the potential of multispectral satellite imagery and emphasized its value in initial reconnaissance studies notably for the identification of geological structures and lithologies. I would emphasize also its value at more advanced stages of exploration when reinterpretation of imagery at large scales and with reference to ground truth data can yield valuable information. My second point, which follows naturally from the first, is that effective interpretation of remotely sensed imagery requires an appreciation of the geographical environment as well as the geological environment. It is reflectances from the components of the geographical environment that produce the colours and tones seen on the colour composites generated from Landsat imagery. Except in arid areas largely devoid of plant cover, in natural terrain reflectances from vegetation dominate over those from soils and bedrock. Their contribution increases with increasing density of cover. The reflectances from different types of vegetation and from individual plant species, however, vary greatly, depending on the geometry of the canopy, the colour of foliage, the size, shape, angle, etc., of leaves, and the turgidity, water content and nutrient status of leaf cells. It is the differences in vegetation cover producing differing reflectances that permit the discrimination of lithologies and identification of structures on colour composites generated from Landsat imagery. In some areas, however, any or all of relict laterite, superficial cover, former and ephemeral drainage systems, and other physiographic features that are the legacies of geomorphological processes, complicate relations. These need to be understood for effective evaluation of imagery for geological purposes. In this context there is no substitute for field investigations, which are essential for the acquisition of ground truth data needed for effective evaluation of imagery.

Download Full-text

EXPLORING MACHINE LEARNING CLASSIFICATION ALGORITHMS FOR CROP CLASSIFICATION USING SENTINEL 2 DATA

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-3-w6-573-2019 ◽

2019 ◽

Vol XLII-3/W6 ◽

pp. 573-578 ◽

Cited By ~ 3

Author(s):

◽

S. S. Ray

Keyword(s):

Machine Learning ◽

Satellite Data ◽

Classification Accuracy ◽

Ground Truth ◽

Kappa Coefficient ◽

Ground Truth Data ◽

Classification Techniques ◽

Machine Learning Classification ◽

Crop Classification ◽

Sentinel 2

Abstract. Crop Classification and recognition is a very important application of Remote Sensing. In the last few years, Machine learning classification techniques have been emerging for crop classification. Google Earth Engine (GEE) is a platform to explore the multiple satellite data with different advanced classification techniques without even downloading the satellite data. The main objective of this study is to explore the ability of different machine learning classification techniques like, Random Forest (RF), Classification And Regression Trees (CART) and Support Vector Machine (SVM) for crop classification. High Resolution optical data, Sentinel-2, MSI (10&thinsp;m) was used for crop classification in the Indian Agricultural Research Institute (IARI) farm for the Rabi season 2016 for major crops. Around 100 crop fields (~400 Hectare) in IARI were analysed. Smart phone-based ground truth data were collected. The best cloud free image of Sentinel 2 MSI data (5 Feb 2016) was used for classification using automatic filtering by percentage cloud cover property using the GEE. Polygons as feature space was used as training data sets based on the ground truth data for crop classification using machine learning techniques. Post classification, accuracy assessment analysis was done through the generation of the confusion matrix (producer and user accuracy), kappa coefficient and F value. In this study it was found that using GEE through cloud platform, satellite data accessing, filtering and pre-processing of satellite data could be done very efficiently. In terms of overall classification accuracy and kappa coefficient, Random Forest (93.3%, 0.9178) and CART (73.4%, 0.6755) classifiers performed better than SVM (74.3%, 0.6867) classifier. For validation, Field Operation Service Unit (FOSU) division of IARI, data was used and encouraging results were obtained.

Download Full-text

Integrating hierarchical statistical models and machine-learning algorithms for ground-truthing drone images of the vegetation: taxonomy, abundance and population ecological models

10.1101/491381 ◽

2018 ◽

Cited By ~ 1

Author(s):

Christian Damgaard

Keyword(s):

Machine Learning ◽

Statistical Models ◽

Learning Algorithms ◽

Plant Competition ◽

Image Data ◽

Ground Truth ◽

Ecological Models ◽

Machine Learning Algorithms ◽

Ground Truth Data ◽

Ground Truthing

AbstractIn order to fit population ecological models, e.g. plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Download Full-text

Fast and accurate learned multiresolution dynamical downscaling for precipitation

10.5194/gmd-2020-412 ◽

2021 ◽

Author(s):

Jiali Wang ◽

Zhengchun Liu ◽

Ian Foster ◽

Won Chang ◽

Rajkumar Kettimuthu ◽

...

Keyword(s):

Neural Network ◽

High Resolution ◽

Dynamical Downscaling ◽

Computational Cost ◽

Super Resolution ◽

Ground Truth ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Spatial And Temporal Distributions ◽

Data Variability

Abstract. This study develops a neural network-based approach for emulating high-resolution modeled precipitation data with comparable statistical properties but at greatly reduced computational cost. The key idea is to use combination of low- and high- resolution simulations to train a neural network to map from the former to the latter. Specifically, we define two types of CNNs, one that stacks variables directly and one that encodes each variable before stacking, and we train each CNN type both with a conventional loss function, such as mean square error (MSE), and with a conditional generative adversarial network (CGAN), for a total of four CNN variants.We compare the four new CNN-derived high-resolution precipitation results with precipitation generated from original high resolution simulations, a bilinear interpolater and the state-of-the-art CNN-based super-resolution (SR) technique. Results show that the SR technique produces results similar to those of the bilinear interpolator with smoother spatial and temporal distributions and smaller data variabilities and extremes than the high resolution simulations. While the new CNNs trained by MSE generate better results over some regions than the interpolator and SR technique do, their predictions are still not as close as ground truth. The CNNs trained by CGAN generate more realistic and physically reasonable results, better capturing not only data variability in time and space but also extremes such as intense and long-lasting storms. The new proposed CNN-based downscaling approach can downscale precipitation from 50 km to 12 km in 14 min for 30 years once the network is trained (training takes 4 hours using 1 GPU), while the conventional dynamical downscaling would take 1 months using 600 CPU cores to generate simulations at the resolution of 12 km over contiguous United States.

Download Full-text

A Comparative Assessment of Ensemble-Based Machine Learning and Maximum Likelihood Methods for Mapping Seagrass Using Sentinel-2 Imagery in Tauranga Harbor, New Zealand

Remote Sensing ◽

10.3390/rs12030355 ◽

2020 ◽

Vol 12 (3) ◽

pp. 355 ◽

Cited By ~ 10

Author(s):

Nam Thang Ha ◽

Merilyn Manley-Harris ◽

Tien Dat Pham ◽

Ian Hawes

Keyword(s):

Machine Learning ◽

New Zealand ◽

Maximum Likelihood ◽

Ground Truth ◽

Machine Learning Techniques ◽

Ground Truth Data ◽

Seagrass Meadows ◽

Ensemble Machine Learning ◽

Novel Approach ◽

Sentinel 2

Seagrass has been acknowledged as a productive blue carbon ecosystem that is in significant decline across much of the world. A first step toward conservation is the mapping and monitoring of extant seagrass meadows. Several methods are currently in use, but mapping the resource from satellite images using machine learning is not widely applied, despite its successful use in various comparable applications. This research aimed to develop a novel approach for seagrass monitoring using state-of-the-art machine learning with data from Sentinel–2 imagery. We used Tauranga Harbor, New Zealand as a validation site for which extensive ground truth data are available to compare ensemble machine learning methods involving random forests (RF), rotation forests (RoF), and canonical correlation forests (CCF) with the more traditional maximum likelihood classifier (MLC) technique. Using a group of validation metrics including F1, precision, recall, accuracy, and the McNemar test, our results indicated that machine learning techniques outperformed the MLC with RoF as the best performer (F1 scores ranging from 0.75–0.91 for sparse and dense seagrass meadows, respectively). Our study is the first comparison of various ensemble-based methods for seagrass mapping of which we are aware, and promises to be an effective approach to enhance the accuracy of seagrass monitoring.

Download Full-text

Assessing the tropical forest cover change in northern parts of Sonitpur and Udalguri District of Assam, India

Scientific Reports ◽

10.1038/s41598-021-90595-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ranjit Mahato ◽

Gibji Nimasow ◽

Oyi Dai Nimasow ◽

Dhoni Bushi

Keyword(s):

Protected Areas ◽

National Park ◽

Large Scale ◽

Forest Cover ◽

Ground Truth ◽

Forest Cover Change ◽

Wildlife Sanctuary ◽

Remotely Sensed Data ◽

Ground Truth Data ◽

Increasing Trends

AbstractSonitpur and Udalguri district of Assam possess rich tropical forests with equally important faunal species. The Nameri National Park, Sonai-Rupai Wildlife Sanctuary, and other Reserved Forests are areas of attraction for tourists and wildlife lovers. However, these protected areas are reportedly facing the problem of encroachment and large-scale deforestation. Therefore, this study attempts to estimate the forest cover change in the area through integrating the remotely sensed data of 1990, 2000, 2010, and 2020 with the Geographic Information System. The Maximum Likelihood algorithm-based supervised classification shows acceptable agreement between the classified image and the ground truth data with an overall accuracy of about 96% and a Kappa coefficient of 0.95. The results reveal a forest cover loss of 7.47% from 1990 to 2000 and 7.11% from 2000 to 2010. However, there was a slight gain of 2.34% in forest cover from 2010 to 2020. The net change of forest to non-forest was 195.17 km2 in the last forty years. The forest transition map shows a declining trend of forest remained forest till 2010 and a slight increase after that. There was a considerable decline in the forest to non-forest (11.94% to 3.50%) from 2000–2010 to 2010–2020. Further, a perceptible gain was also observed in the non-forest to the forest during the last four decades. The overlay analysis of forest cover maps show an area of 460.76 km2 (28.89%) as forest (unchanged), 764.21 km2 (47.91%) as non-forest (unchanged), 282.67 km2 (17.72%) as deforestation and 87.50 km2 (5.48%) as afforestation. The study found hotspots of deforestation in the closest areas of National Park, Wildlife Sanctuary, and Reserved Forests due to encroachments for human habitation, agriculture, and timber/fuelwood extractions. Therefore, the study suggests an early declaration of these protected areas as Eco-Sensitive Zone to control the increasing trends of deforestation.

Download Full-text