Applying machine learning and differential evolution optimization for soil texture predictions at national scale (Germany)

Mapping Intimacies ◽

10.5194/egusphere-egu2020-11797 ◽

2020 ◽

Author(s):

Anika Gebauer ◽

Ali Sakhaee ◽

Axel Don ◽

Mareike Ließ

Keyword(s):

Machine Learning ◽

Differential Evolution ◽

Cross Validation ◽

Differential Evolution Algorithm ◽

Model Performance ◽

Boosted Regression Trees ◽

Training Data ◽

Training Dataset ◽

National Scale ◽

Sampling Points

In order to assess the carbon and water storage capacity of agricultural soils at national scale (Germany), spatially continuous, high-resolution soil information on the particle size distribution is an essential requirement. Machine learning models are good at computing complex, composite non-linear functions. They can be trained on point data to relate soil properties (response variable) to approximations of soil forming factors (predictors). Finally, the obtained models can be used for spatial soil property predictions.We developed models for topsoil texture regionalization using two powerful algorithms: the boosted regression trees machine learning algorithm, and the differential evolution algorithm applied for parameter tuning. Texture data (clay, silt, sand) originated from two sources: (1) the new soil database of the German Agricultural Soil Inventory (BZE), and (2) the well-known, publicly available database of the European Land Use / Cover Land Survey (LUCAS). BZE texture data results from an eight-kilometer sampling raster (2991 sampling points). LUCAS data from soils under agricultural use (Germany) comprises 1377 sampling points. The predictor datasets included DEM-based topography variables, information on the geographic position, and legacy maps of soil systematic units. In a first step, a nested five-fold cross-validation approach was used to tune and train models on the BZE data. In a second step, the amount of training data was increased by adding two-thirds of the LUCAS data. Model performance was evaluated by (1) cross-validation (RCV&#178;), and (2) by using the remaining LUCAS data as an independent external test set (Rexternal&#178;).Models trained on the BZE data were able to predict the nation-wide spatial distribution of clay, silt and sand (RCV&#178; = 0.57 &#8211; 0.76; Rexternal&#178; = 0.68 &#8211; 0.83). Model performance was further enhanced by adding the LUCAS data to the training dataset.

Download Full-text

Improving 3-m Resolution Land Cover Mapping through Efficient Learning from an Imperfect 10-m Resolution Map

Remote Sensing ◽

10.3390/rs12091418 ◽

2020 ◽

Vol 12 (9) ◽

pp. 1418

Author(s):

Runmin Dong ◽

Cong Li ◽

Haohuan Fu ◽

Jie Wang ◽

Weijia Li ◽

...

Keyword(s):

Land Cover ◽

Training Data ◽

Training Dataset ◽

Land Cover Mapping ◽

Remotely Sensed Data ◽

Large Area ◽

National Scale ◽

Substantial Progress ◽

Efficient Learning ◽

Land Cover Maps

Substantial progress has been made in the field of large-area land cover mapping as the spatial resolution of remotely sensed data increases. However, a significant amount of human power is still required to label images for training and testing purposes, especially in high-resolution (e.g., 3-m) land cover mapping. In this research, we propose a solution that can produce 3-m resolution land cover maps on a national scale without human efforts being involved. First, using the public 10-m resolution land cover maps as an imperfect training dataset, we propose a deep learning based approach that can effectively transfer the existing knowledge. Then, we improve the efficiency of our method through a network pruning process for national-scale land cover mapping. Our proposed method can take the state-of-the-art 10-m resolution land cover maps (with an accuracy of 81.24% for China) as the training data, enable a transferred learning process that can produce 3-m resolution land cover maps, and further improve the overall accuracy (OA) to 86.34% for China. We present detailed results obtained over three mega cities in China, to demonstrate the effectiveness of our proposed approach for 3-m resolution large-area land cover mapping.

Download Full-text

PREDICTION AND ANALYSIS OF GEOMECHANICAL PROPERTIES OF JIMUSAER SHALE USING A MACHINE LEARNING APPROACH

10.30632/spwla-2021-0089 ◽

2021 ◽

Author(s):

Lianteng Song ◽

◽

Zhonghua Liu ◽

Chaoliu Li ◽

Congqian Ning ◽

...

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Gamma Ray ◽

Short Term Memory ◽

Machine Learning Algorithms ◽

Training Data ◽

Sequential Data ◽

Log Data ◽

Geomechanical Properties ◽

Single Well

Geomechanical properties are essential for safe drilling, successful completion, and exploration of both conven-tional and unconventional reservoirs, e.g. deep shale gas and shale oil. Typically, these properties could be calcu-lated from sonic logs. However, in shale reservoirs, it is time-consuming and challenging to obtain reliable log-ging data due to borehole complexity and lacking of in-formation, which often results in log deficiency and high recovery cost of incomplete datasets. In this work, we propose the bidirectional long short-term memory (BiL-STM) which is a supervised neural network algorithm that has been widely used in sequential data-based pre-diction to estimate geomechanical parameters. The pre-diction from log data can be conducted from two differ-ent aspects. 1) Single-Well prediction, the log data from a single well is divided into training data and testing data for cross validation; 2) Cross-Well prediction, a group of wells from the same geographical region are divided into training set and testing set for cross validation, as well. The logs used in this work were collected from 11 wells from Jimusaer Shale, which includes gamma ray, bulk density, resistivity, and etc. We employed 5 vari-ous machine learning algorithms for comparison, among which BiLSTM showed the best performance with an R-squared of more than 90% and an RMSE of less than 10. The predicted results can be directly used to calcu-late geomechanical properties, of which accuracy is also improved in contrast to conventional methods.

Download Full-text

Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery

10.5194/egusphere-egu21-6853 ◽

2021 ◽

Author(s):

Rudy Venguswamy ◽

Mike Levy ◽

Anirudh Koul ◽

Satyarth Praveen ◽

Tarun Narayanan ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Open Source ◽

Forest Fires ◽

Seed Set ◽

Training Data ◽

Training Dataset ◽

Reference Image ◽

Query Image ◽

Real World Datasets

Machine learning modeling for Earth events at NASA is often limited by the availability of labeled examples. For example, training classifiers for forest fires or oil spills from satellite imagery requires curating a massive and diverse dataset of example forest fires, a tedious multi-month effort requiring careful review of over 196.9 million square miles of data per day for 20 years. While such images might exist in abundance within 40 petabytes of unlabeled satellite data, finding these positive examples to include in a training dataset for a machine learning model is extremely time-consuming and requires researchers to "hunt" for positive examples, like finding a needle in a haystack.&#160;We present a no-code open-source tool, Curator, whose goal is to minimize the amount of human manual image labeling needed to achieve a state of the art classifier. The pipeline, purpose-built to take advantage of the massive amount of unlabeled images, consists of (1) self-supervision training to convert unlabeled images into meaningful representations, (2) search-by-example to collect a seed set of images, (3) human-in-the-loop active learning to iteratively ask for labels on uncertain examples and train on them.&#160;In step 1, a model capable of representing unlabeled images meaningfully is trained with a self-supervised algorithm (like SimCLR) on a random subset of the dataset (that conforms to researchers&#8217; specified &#8220;training budget.&#8221;). Since real-world datasets are often imbalanced leading to suboptimal models, the initial model is used to generate embeddings on the entire dataset. Then, images with equidistant embeddings are sampled. This iterative training and resampling strategy improves both balanced training data and models every iteration. In step 2, researchers supply an example image of interest, and the output embeddings generated from this image are used to find other images with embeddings near the reference image&#8217;s embedding in euclidean space (hence similar looking images to the query image). These proposed candidate images contain a higher density of positive examples and are annotated manually as a seed set. In step 3, the seed labels are used to train a classifier to identify more candidate images for human inspection with active learning. Each classification training loop, candidate images for labeling are sampled from the larger unlabeled dataset based on the images that the model is most uncertain about (p &#8776; 0.5).Curator is released as an open-source package built on PyTorch-Lightning. The pipeline uses GPU-based transforms from the NVIDIA-Dali package for augmentation, leading to a 5-10x speed up in self-supervised training and is run from the command line.By iteratively training a self-supervised model and a classifier in tandem with human manual annotation, this pipeline is able to unearth more positive examples from severely imbalanced datasets which were previously untrainable with self-supervision algorithms. In applications such as detecting wildfires, atmospheric dust, or turning outward with telescopic surveys, increasing the number of positive candidates presented to humans for manual inspection increases the efficacy of classifiers and multiplies the efficiency of researchers&#8217; data curation efforts.

Download Full-text

A machine learning framework to determine geolocations from metagenomic profiling

Biology Direct ◽

10.1186/s13062-020-00278-z ◽

2020 ◽

Vol 15 (1) ◽

Cited By ~ 1

Author(s):

Lihong Huang ◽

Canqiang Xu ◽

Wenxian Yang ◽

Rongshan Yu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Geographic Origin ◽

Training Data ◽

Metagenomic Data ◽

Training Dataset ◽

Kriging Interpolation ◽

Learning Framework ◽

Testing Data ◽

Microbial Samples

Abstract Background Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Results Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Conclusion Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.

Download Full-text

Assessing Continuous Operator Workload With a Hybrid Scaffolded Neuroergonomic Modeling Approach

Human Factors The Journal of the Human Factors and Ergonomics Society ◽

10.1177/0018720816672308 ◽

2017 ◽

Vol 59 (1) ◽

pp. 134-146 ◽

Cited By ~ 8

Author(s):

Brett J. Borghetti ◽

Joseph J. Giametta ◽

Christina F. Rusnock

Keyword(s):

Machine Learning ◽

Adaptive Systems ◽

Model Performance ◽

Machine Learning Algorithms ◽

Training Data ◽

State Assessments ◽

Learning Models ◽

Dynamic Task ◽

Operator Workload ◽

Machine Learning Models

Objective: We aimed to predict operator workload from neurological data using statistical learning methods to fit neurological-to-state-assessment models. Background: Adaptive systems require real-time mental workload assessment to perform dynamic task allocations or operator augmentation as workload issues arise. Neuroergonomic measures have great potential for informing adaptive systems, and we combine these measures with models of task demand as well as information about critical events and performance to clarify the inherent ambiguity of interpretation. Method: We use machine learning algorithms on electroencephalogram (EEG) input to infer operator workload based upon Improved Performance Research Integration Tool workload model estimates. Results: Cross-participant models predict workload of other participants, statistically distinguishing between 62% of the workload changes. Machine learning models trained from Monte Carlo resampled workload profiles can be used in place of deterministic workload profiles for cross-participant modeling without incurring a significant decrease in machine learning model performance, suggesting that stochastic models can be used when limited training data are available. Conclusion: We employed a novel temporary scaffold of simulation-generated workload profile truth data during the model-fitting process. A continuous workload profile serves as the target to train our statistical machine learning models. Once trained, the workload profile scaffolding is removed and the trained model is used directly on neurophysiological data in future operator state assessments. Application: These modeling techniques demonstrate how to use neuroergonomic methods to develop operator state assessments, which can be employed in adaptive systems.

Download Full-text

A real-world demonstration of machine learning generalizability in the detection of intracranial hemorrhage on head computerized tomography

Scientific Reports ◽

10.1038/s41598-021-95533-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hojjat Salehinejad ◽

Jumpei Kitamura ◽

Noah Ditkofsky ◽

Amy Lin ◽

Aditya Bharatha ◽

...

Keyword(s):

Machine Learning ◽

Medical Imaging ◽

Intracranial Hemorrhage ◽

Real World ◽

External Validation ◽

Model Performance ◽

Training Dataset ◽

Validation Dataset ◽

Great Promise ◽

Clinical Environments

AbstractMachine learning (ML) holds great promise in transforming healthcare. While published studies have shown the utility of ML models in interpreting medical imaging examinations, these are often evaluated under laboratory settings. The importance of real world evaluation is best illustrated by case studies that have documented successes and failures in the translation of these models into clinical environments. A key prerequisite for the clinical adoption of these technologies is demonstrating generalizable ML model performance under real world circumstances. The purpose of this study was to demonstrate that ML model generalizability is achievable in medical imaging with the detection of intracranial hemorrhage (ICH) on non-contrast computed tomography (CT) scans serving as the use case. An ML model was trained using 21,784 scans from the RSNA Intracranial Hemorrhage CT dataset while generalizability was evaluated using an external validation dataset obtained from our busy trauma and neurosurgical center. This real world external validation dataset consisted of every unenhanced head CT scan (n = 5965) performed in our emergency department in 2019 without exclusion. The model demonstrated an AUC of 98.4%, sensitivity of 98.8%, and specificity of 98.0%, on the test dataset. On external validation, the model demonstrated an AUC of 95.4%, sensitivity of 91.3%, and specificity of 94.1%. Evaluating the ML model using a real world external validation dataset that is temporally and geographically distinct from the training dataset indicates that ML generalizability is achievable in medical imaging applications.

Download Full-text

Tactical Predictions of Shoreline Oiling Probability via Machine Learning Models and Satellite-Derived Surface Oil Analysis Products

International Oil Spill Conference Proceedings ◽

10.7901/2169-3358-2014.1.660 ◽

2014 ◽

Vol 2014 (1) ◽

pp. 660-672

Author(s):

Zachary Nixon

Keyword(s):

Machine Learning ◽

Spatial Scales ◽

Information Service ◽

Habitat Type ◽

Model Performance ◽

Boosted Regression Trees ◽

Oil Analysis ◽

Spatial Indices ◽

Spatial And Temporal Scales ◽

Trajectory Models

ABSTRACT For significant oil spills in remote areas with complex shoreline geometry, apportioning Shoreline Cleanup Assessment Technique (SCAT) survey effort is a complicated and difficult task. Aerial surveys are often used to select shoreline areas for ground survey after an initial prioritization based upon anecdotal reports or trajectory models, but aerial observers may have difficulty locating cryptic surface shoreline oiling in vegetated or other complex environments. In dynamic beach environments, stranded shoreline oiling may be rapidly buried, making aerial observation difficult. A machine learning-based model is presented for estimating shoreline oiling probabilities via satellite-derived surface oil analysis products, wind summary data, and shoreline habitat type and geometry data. These inputs are increasingly available at spatial and temporal scales sufficient for tactical use, enabling model predictions to be generated within hours after satellite remote sensing products are available. The model was constructed using SCAT data from the Deepwater Horizon oil spill, satellite-derived surface oil analysis products generated during the spill by NOAA's National Environmental Satellite, Data, and Information Service (NESDIS) using a variety of satellite platforms of opportunity, and available shoreline geometry, character, and other preexisting data. The model involves the generation of set of spatial indices of relative over-water proximity of surface oil slicks based upon the satellite-derived analysis products. The model then uses boosted regression trees (BRT), a flexible and relatively recently developed modeling methodology, to generate calibrated estimates of probability of subsequent shoreline oiling based upon these indices, wind climatological data over the time period of interest, and other shoreline data. The model can be implemented via data preparation in any Geographic Information System (GIS) software coupled with the open-source statistical computing language, R. The model is entirely probabilistic and makes no attempt to reproduce the physics of oil moving through the environment, as do trajectory models. It is best used in concert with such models to make estimates at different spatial scales, or when time and data requirements make implementation of fine-scale trajectory modeling impractical for tactical use. The details of model development implementation and assessments of model performance and limitations are presented.

Download Full-text

Soil Organic Carbon Prediction at National Scale (Germany)

10.5194/egusphere-egu2020-10552 ◽

2020 ◽

Author(s):

Ali Sakhaee ◽

Anika Gebauer ◽

Mareike Ließ ◽

Axel Don

Keyword(s):

Organic Carbon ◽

Soil Organic Carbon ◽

Agricultural Soils ◽

Model Performance ◽

Spatial Prediction ◽

Spatial Modelling ◽

Agricultural Landscapes ◽

Boosted Regression Trees ◽

Environmental Drivers ◽

National Scale

Soil Organic Carbon (SOC) plays a crucial role in agricultural ecosystems. However, its abundance is spatially variable at different scales. In recent years, machine learning (ML) algorithms have become an important tool in the spatial prediction of SOC at regional to continental scales. Particularly in agricultural landscapes, the prediction of SOC is a challenging task.In this study, our aim is to evaluate the capability of two ML algorithms (Random Forest and Boosted Regression Trees) for topsoil (0 to 30 cm) SOC prediction in soils under agricultural use at national scale for Germany. In order to build the models, 50 environmental covariates representing topography, climate factors, land use as well as soil properties were selected. The SOC data we used was from the German Agricultural Soil inventory (2947 sampling points). A nested 5-fold cross-validation was used for model tuning and evaluation. Hyperparameter tuning for both ML algorithms was done by differential evolution optimization.&#160;This approach allows exploring an extensive set of field data in combination with state of the art pedometric tools. With a strict validation scheme, the geospatial-model performance was assessed. Current results indicate that the spatial SOC variation is to a minor extent predictable with the considered covariate data (<30% explained variance). This may partly be explained by a non-steady state of SOC content in agricultural soils with environmental drivers. We discuss the challenges of geo-spatial modelling and the value of ML algorithms in pedometrics.

Download Full-text

Using satellite imagery to understand and promote sustainable development

Science ◽

10.1126/science.abe8628 ◽

2021 ◽

Vol 371 (6535) ◽

pp. eabe8628

Author(s):

Marshall Burke ◽

Anne Driscoll ◽

David B. Lobell ◽

Stefano Ermon

Keyword(s):

Machine Learning ◽

Sustainable Development ◽

Satellite Imagery ◽

Model Building ◽

Model Performance ◽

Training Data ◽

Learning Approaches ◽

Research Directions ◽

Development Outcomes ◽

Research And Policy

Accurate and comprehensive measurements of a range of sustainable development outcomes are fundamental inputs into both research and policy. We synthesize the growing literature that uses satellite imagery to understand these outcomes, with a focus on approaches that combine imagery with machine learning. We quantify the paucity of ground data on key human-related outcomes and the growing abundance and improving resolution (spatial, temporal, and spectral) of satellite imagery. We then review recent machine learning approaches to model-building in the context of scarce and noisy training data, highlighting how this noise often leads to incorrect assessment of model performance. We quantify recent model performance across multiple sustainable development domains, discuss research and policy applications, explore constraints to future progress, and highlight research directions for the field.

Download Full-text

Development of a nomogram for predicting recurrence in breast cancer patients using a machine learning method

International Journal of Community Medicine and Public Health ◽

10.18203/2394-6040.ijcmph20202994 ◽

2020 ◽

Vol 7 (7) ◽

pp. 2661

Author(s):

Abhishek O. Tibrewal

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Univariate Analysis ◽

Model Performance ◽

Treatment Modalities ◽

Training Dataset ◽

Breast Cancer Patients ◽

Test Dataset ◽

Statistical Measures ◽

Degree Of Malignancy

Background: Current breast cancer (BC) recurrence models do not account for treatment modalities, one of the strongest prognostic factors. This analysis was conducted to apply machine learning (ML) algorithm to identify BC patients at a higher recurrence risk.Methods: It is based on a downloadable BC Wisconsin dataset, containing 9 independent (socio-demographic, tumor and treatment-related) and a dependent (recurrence) variable(s). Using training dataset (70% sample), a multivariate LR model was developed using univariate analysis identified variables (p<0.2). The model performance was assessed on test dataset (remaining 30%) using standard statistical measures. A nomogram was developed using model identified variables (p<0.05), and its cut-off score categorized BC patients into a high/low recurrence risk.Results: 277 patients (recurrence (n=81)) were included. In univariate analysis, tumor size (p=0.002), invasive nodes number (p<0.001), node capsule (p<0.001), degree of malignancy (p<0.001) and irradiation (p<0.001) were associated with recurrence. After balancing, both groups included 243 patients. Using training dataset (n=342), invasive nodes (p<0.05), degree of malignancy (p<0.05) and irradiation (p=0.0009) were significant in a multivariate model. The model’s accuracy and area under curve (AUC) were 74% (66-81%) and 0.74 (0.67-0.81), respectively in the test dataset (n=144). The nomogram’s cut-off score of 55 has an AUC of 0.73 (0.66-0.80) for recurrence prediction, indicative fair discriminating ability.Conclusions: The developed nomogram can be a valuable tool in guiding appropriate treatment based on recurrence risk. ML and data mining methods can be the future of clinical decision process.

Download Full-text