Using Remote Sensing and Machine Learning to Locate Groundwater Discharge to Salmon-Bearing Streams

We hypothesized topographic features alone could be used to locate groundwater discharge, but only where diagnostic topographic signatures could first be identified through the use of limited field observations and geologic data. We built a geodatabase from geologic and topographic data, with the geologic data only covering ~40% of the study area and topographic data derived from airborne LiDAR covering the entire study area. We identified two types of groundwater discharge: shallow hillslope groundwater discharge, commonly manifested as diffuse seeps, and aquifer-outcrop groundwater discharge, commonly manifested as springs. We developed multistep manual procedures that allowed us to accurately predict the locations of both types of groundwater discharge in 93% of cases, though only where geologic data were available. However, field verification suggested that both types of groundwater discharge could be identified by specific combinations of topographic variables alone. We then applied maximum entropy modeling, a machine learning technique, to predict the prevalence of both types of groundwater discharge using six topographic variables: profile curvature range, with a permutation importance of 43.2%, followed by distance to flowlines, elevation, topographic roughness index, flow-weighted slope, and planform curvature, with permutation importance of 20.8%, 18.5%, 15.2%, 1.8%, and 0.5%, respectively. The AUC values for the model were 0.95 for training data and 0.91 for testing data, indicating outstanding model performance.

Download Full-text

Fault-Guided Seismic Stratigraphy Interpretation via Semi-Supervised Learning

10.2118/207218-ms ◽

2021 ◽

Author(s):

Haibin Di ◽

Chakib Kada Kloucha ◽

Cen Li ◽

Aria Abubakar ◽

Zhun Li ◽

...

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Model Building ◽

Structural Information ◽

Mapping Function ◽

Seismic Stratigraphy ◽

Training Data ◽

Entire Study ◽

Depositional Process ◽

Convolutional Autoencoder

Abstract Delineating seismic stratigraphic features and depositional facies is of importance to successful reservoir mapping and identification in the subsurface. Robust seismic stratigraphy interpretation is confronted with two major challenges. The first one is to maximally automate the process particularly with the increasing size of seismic data and complexity of target stratigraphies, while the second challenge is to efficiently incorporate available structures into stratigraphy model building. Machine learning, particularly convolutional neural network (CNN), has been introduced into assisting seismic stratigraphy interpretation through supervised learning. However, the small amount of available expert labels greatly restricts the performance of such supervised CNN. Moreover, most of the exiting CNN implementations are based on only amplitude, which fails to use necessary structural information such as faults for constraining the machine learning. To resolve both challenges, this paper presents a semi-supervised learning workflow for fault-guided seismic stratigraphy interpretation, which consists of two components. The first component is seismic feature engineering (SFE), which aims at learning the provided seismic and fault data through a unsupervised convolutional autoencoder (CAE), while the second one is stratigraphy model building (SMB), which aims at building an optimal mapping function between the features extracted from the SFE CAE and the target stratigraphic labels provided by an experienced interpreter through a supervised CNN. Both components are connected by embedding the encoder of the SFE CAE into the SMB CNN, which forces the SMB learning based on these features commonly existing in the entire study area instead of those only at the limited training data; correspondingly, the risk of overfitting is greatly eliminated. More innovatively, the fault constraint is introduced by customizing the SMB CNN of two output branches, with one to match the target stratigraphies and the other to reconstruct the input fault, so that the fault continues contributing to the process of SMB learning. The performance of such fault-guided seismic stratigraphy interpretation is validated by an application to a real seismic dataset, and the machine prediction not only matches the manual interpretation accurately but also clearly illustrates the depositional process in the study area.

Download Full-text

Assessing Continuous Operator Workload With a Hybrid Scaffolded Neuroergonomic Modeling Approach

Human Factors The Journal of the Human Factors and Ergonomics Society ◽

10.1177/0018720816672308 ◽

2017 ◽

Vol 59 (1) ◽

pp. 134-146 ◽

Cited By ~ 8

Author(s):

Brett J. Borghetti ◽

Joseph J. Giametta ◽

Christina F. Rusnock

Keyword(s):

Machine Learning ◽

Adaptive Systems ◽

Model Performance ◽

Machine Learning Algorithms ◽

Training Data ◽

State Assessments ◽

Learning Models ◽

Dynamic Task ◽

Operator Workload ◽

Machine Learning Models

Objective: We aimed to predict operator workload from neurological data using statistical learning methods to fit neurological-to-state-assessment models. Background: Adaptive systems require real-time mental workload assessment to perform dynamic task allocations or operator augmentation as workload issues arise. Neuroergonomic measures have great potential for informing adaptive systems, and we combine these measures with models of task demand as well as information about critical events and performance to clarify the inherent ambiguity of interpretation. Method: We use machine learning algorithms on electroencephalogram (EEG) input to infer operator workload based upon Improved Performance Research Integration Tool workload model estimates. Results: Cross-participant models predict workload of other participants, statistically distinguishing between 62% of the workload changes. Machine learning models trained from Monte Carlo resampled workload profiles can be used in place of deterministic workload profiles for cross-participant modeling without incurring a significant decrease in machine learning model performance, suggesting that stochastic models can be used when limited training data are available. Conclusion: We employed a novel temporary scaffold of simulation-generated workload profile truth data during the model-fitting process. A continuous workload profile serves as the target to train our statistical machine learning models. Once trained, the workload profile scaffolding is removed and the trained model is used directly on neurophysiological data in future operator state assessments. Application: These modeling techniques demonstrate how to use neuroergonomic methods to develop operator state assessments, which can be employed in adaptive systems.

Download Full-text

Using satellite imagery to understand and promote sustainable development

Science ◽

10.1126/science.abe8628 ◽

2021 ◽

Vol 371 (6535) ◽

pp. eabe8628

Author(s):

Marshall Burke ◽

Anne Driscoll ◽

David B. Lobell ◽

Stefano Ermon

Keyword(s):

Machine Learning ◽

Sustainable Development ◽

Satellite Imagery ◽

Model Building ◽

Model Performance ◽

Training Data ◽

Learning Approaches ◽

Research Directions ◽

Development Outcomes ◽

Research And Policy

Accurate and comprehensive measurements of a range of sustainable development outcomes are fundamental inputs into both research and policy. We synthesize the growing literature that uses satellite imagery to understand these outcomes, with a focus on approaches that combine imagery with machine learning. We quantify the paucity of ground data on key human-related outcomes and the growing abundance and improving resolution (spatial, temporal, and spectral) of satellite imagery. We then review recent machine learning approaches to model-building in the context of scarce and noisy training data, highlighting how this noise often leads to incorrect assessment of model performance. We quantify recent model performance across multiple sustainable development domains, discuss research and policy applications, explore constraints to future progress, and highlight research directions for the field.

Download Full-text

Graph Neural Networks Bootstrapped for Synthetic Selection and Validation of Small Molecule Immunomodulators

10.33774/chemrxiv-2021-r4xnx-v2 ◽

2021 ◽

Author(s):

Prageeth R. Wijewardhane ◽

Krupal P. Jethava ◽

Jonathan A Fine ◽

Gaurav Chopra

Keyword(s):

Neural Network ◽

Machine Learning ◽

Small Molecule ◽

Model Performance ◽

Cost Effective ◽

Bioactive Compound ◽

Binding Pocket ◽

Chemical Diversity ◽

Training Data ◽

Kappa Score

The Programmed Cell Death Protein 1/Programmed Death-Ligand 1 (PD-1/PD-L1) interaction is an immune checkpoint utilized by cancer cells to enhance immune suppression. There is a huge need to develop small molecule drugs that are fast acting, cost effective, and readily bioavailable compared to antibodies. Unfortunately, synthesizing and validating large libraries of small- molecules to inhibit PD-1/PD-L1 interaction in a blind manner is both time-consuming and expensive. To improve this drug discovery pipeline, we have developed a machine learning methodology trained on patent data to identify, synthesize, and validate PD-1/PD-L1 small molecule inhibitors. Our model incorporates two features: docking scores to represent the energy of binding (E) as a global feature and sub-graph features through a graph neural network (GNN) of molecular topology to represent local features. This interaction energy-based Graph Neural Network (EGNN) model outperforms traditional machine learning methods and a simple GNN with a F1 score of 0.9524 and Cohen’s kappa score of 0.8861 for the hold out test set, suggesting that the topology of the small molecule, the structural interaction in the binding pocket, and chemical diversity of the training data are all important considerations for enhancing model performance. A Bootstrapped EGNN model was used to select compounds for synthesis and experimental validation with predicted high and low potency to inhibit PD-1/PD-L1 interaction. The potent inhibitor, (4-((3-(2,3-dihydrobenzo[b][1,4]dioxin-6-yl)-2- methylbenzyl)oxy)-2,6-dimethoxybenzyl)-D-serine, is a hybrid of two known bioactive scaffolds, with an IC50 of 339.9 nM that is comparatively better than the known bioactive compound. We conclude that our bootstrapped EGNN model will be useful to identify target-specific high potency molecules designed by scaffold hopping, a well-known medicinal chemistry technique.

Download Full-text

Combined Molecular Graph Neural Network and Structural Docking Selects Potent Programmable Cell Death Protein 1/Programmable Death-Ligand 1 (PD-1/PD-L1) Small Molecule Inhibitors

10.26434/chemrxiv.12083907.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Prageeth R. Wijewardhane ◽

Krupal P. Jethava ◽

Jonathan A Fine ◽

Gaurav Chopra

Keyword(s):

Neural Network ◽

Machine Learning ◽

Cell Death ◽

Small Molecule ◽

Small Molecule Inhibitors ◽

Molecular Graph ◽

Model Performance ◽

Chemical Diversity ◽

Training Data ◽

Cell Death Protein

The Programmable Cell Death Protein 1/Programmable Death-Ligand 1 (PD-1/PD-L1) interaction is an immune checkpoint utilized by cancer cells to enhance immune suppression. There exists a huge need to develop small molecules drugs that are fast acting, cheap, and readily bioavailable compared to antibodies. Unfortunately, synthesizing and validating large libraries of small-molecule to inhibit PD-1/PD-L1 interaction in a blind manner is a both time-consuming and expensive. To improve this drug discovery pipeline, we have developed a machine learning methodology trained on patent data to identify, synthesize and validate PD-1/PD-L1 small molecule inhibitors. Our model incorporates two features: docking scores to represent the energy of binding (E) as a global feature and sub-graph features through a graph neural network (GNN) to represent local features. This Energy-Graph Neural Network (EGNN) model outperforms traditional machine learning methods as well as a simple GNN with an average F1 score of 0.997 (± 0.004) suggesting that the topology of the small molecule, the structural interaction in the binding pocket, and chemical diversity of the training data are all important considerations for enhancing model performance. A Bootstrapped EGNN model was used to select compounds for synthesis and experimental validation with predicted high and low potency to inhibit PD-1/PD-L1 interaction. The new potent inhibitor, (4-((3-(2,3-dihydrobenzo[b][1,4]dioxin-6-yl)-2-methylbenzyl)oxy)-2,6-dimethoxybenzyl)-D-serine, is a hybrid of two known bioactive scaffolds, and has an IC50 values of 339.9 nM that is comparatively better than the known bioactive compound. We conclude that our EGNN model can identify active molecules designed by scaffold hopping, a well-known medicinal chemistry technique and will be useful to identify new potent small molecule inhibitors for specific targets.

Download Full-text

Three-level Sleep Stage Classification Based on Wrist-worn Accelerometry Data Alone

10.1101/2021.08.10.455812 ◽

2021 ◽

Author(s):

Jian Hu ◽

Haochang Shou

Keyword(s):

Machine Learning ◽

Sequential Analysis ◽

Short Term Memory ◽

Sleep Stage ◽

Model Performance ◽

Training Data ◽

Sleep Stages ◽

Learning Models ◽

Local Variability ◽

Machine Learning Models

Objective: The use of wearable sensor devices on daily basis to track real-time movements during wake and sleep has provided opportunities for automatic sleep quantification using such data. Existing algorithms for classifying sleep stages often require large training data and multiple input signals including heart rate and respiratory data. We aimed to examine the capability of classifying sleep stages using sensible features directly from accelerometers only with the aid of advanced recurrent neural networks. Materials and Methods: We analyzed a publicly available dataset with accelerometry data in 5s epoch length and polysomnography assessments. We developed long short-term memory (LSTM) models that take the 3-axis accelerations, angles, and temperatures from concurrent and historic observation windows to predict wake, REM and non-REM sleep. Leave-one-subject-out experiments were conducted to compare and evaluate the model performance with conventional nonsequential machine learning models using metrics such as multiclass training and testing accuracy, weighted precision, F1 score and area-under-the-curve (AUC). Results: Our sequential analysis framework outperforms traditional non-sequential models in all aspects of model evaluation metrics. We achieved an average of 65% and a maximum of 81% validation accuracy for classifying three sleep labels even with a relatively small training sample of clinical visitors. The presence of two additional derived variables, local variability and range, have shown to strongly improve the model performance. Discussion : Results indicate that it is crucial to account for deep temporal dependency and assess local variability of the features. The post-hoc analysis of individual model performances on subjects' demographic characteristics also suggest the need of including pathological samples in the training data in order to develop robust machine learning models that are capable of capturing normal and anomaly sleep patterns in the population.

Download Full-text

An Adaptive Deep Ensemble Learning Method for Dynamic Evolving Diagnostic Task Scenarios

Diagnostics ◽

10.3390/diagnostics11122288 ◽

2021 ◽

Vol 11 (12) ◽

pp. 2288

Author(s):

Kaixiang Su ◽

Jiao Wu ◽

Dongxiao Gu ◽

Shanlin Yang ◽

Shuyuan Deng ◽

...

Keyword(s):

Machine Learning ◽

Ensemble Learning ◽

Model Performance ◽

Optimal Number ◽

Training Data ◽

Learning Method ◽

Learning Models ◽

Proposed Model ◽

Public Datasets ◽

Machine Learning Models

Increasingly, machine learning methods have been applied to aid in diagnosis with good results. However, some complex models can confuse physicians because they are difficult to understand, while data differences across diagnostic tasks and institutions can cause model performance fluctuations. To address this challenge, we combined the Deep Ensemble Model (DEM) and tree-structured Parzen Estimator (TPE) and proposed an adaptive deep ensemble learning method (TPE-DEM) for dynamic evolving diagnostic task scenarios. Different from previous research that focuses on achieving better performance with a fixed structure model, our proposed model uses TPE to efficiently aggregate simple models more easily understood by physicians and require less training data. In addition, our proposed model can choose the optimal number of layers for the model and the type and number of basic learners to achieve the best performance in different diagnostic task scenarios based on the data distribution and characteristics of the current diagnostic task. We tested our model on one dataset constructed with a partner hospital and five UCI public datasets with different characteristics and volumes based on various diagnostic tasks. Our performance evaluation results show that our proposed model outperforms other baseline models on different datasets. Our study provides a novel approach for simple and understandable machine learning models in tasks with variable datasets and feature sets, and the findings have important implications for the application of machine learning models in computer-aided diagnosis.

Download Full-text

Machine Learning Parameterization of Mature Tropical Cyclone Boundary Layer

10.5194/egusphere-egu21-9333 ◽

2021 ◽

Author(s):

Le-Yi Wang ◽

Zhe-Min Tan

Keyword(s):

Neural Network ◽

Machine Learning ◽

Boundary Layer ◽

Tropical Cyclone ◽

Field Experiments ◽

Mesoscale Model ◽

Model Performance ◽

Training Data ◽

Extreme Condition ◽

Resolution Data

Tropical cyclone (TC) is among the most destructive weather phenomena on the earth, whose structure and intensity are strongly modulated by TC boundary layer. Mesoscale model used for TC research and prediction must rely on boundary layer parameterization due to low spacial resolution. These boundary layer schemes are mostly developed on field experiments under moderate wind speed. They often underestimate the influence of shear-driven rolls and turbulences. When applied under extreme condition like TC boundary layer, significant bias will be unavoidable. In this study, a novel machine learning model&#8212;one dimensional convolutional neural network (1D-CNN)&#8212;is proposed to tackle the TC boundary layer parameterization dilemma. The 1D-CNN saves about half of the learnable parameters and accomplishes a steady improvement compared to fully-connected neural network. TC large eddy simulation outputs are used as training data of 1D-CNN, which shows strong skewness in calculated turbulent fluxes. The data skewness problem is alleviated in order to reduce 1D-CNN model bias. It is shown in an offline TC boundary layer test that our proposed model, the 1D-CNN, performs significantly better than popular schemes now utilized in TC simulations. Model performance across different scales is essential to final application. It is found that the high resolution data contains the information of low resolution data but not vise versa. The model performance on the extreme data is key to final performance on the whole dataset. Training the model on the highest resolution non-extreme data plus extreme data of different resolutions can secure the robust performance across different scales.

Download Full-text

River Water Salinity Prediction Using Hybrid Machine Learning Models

Water ◽

10.3390/w12102951 ◽

2020 ◽

Vol 12 (10) ◽

pp. 2951 ◽

Cited By ~ 2

Author(s):

Assefa M. Melesse ◽

Khabat Khosravi ◽

John P. Tiefenbacher ◽

Salim Heddam ◽

Sungwon Kim ◽

...

Keyword(s):

Machine Learning ◽

Water Quality ◽

Water Resource Management ◽

Model Performance ◽

Hybrid Algorithms ◽

Training Data ◽

Northern Iran ◽

Water Quality Variables ◽

Testing Data ◽

Better Than

Electrical conductivity (EC), one of the most widely used indices for water quality assessment, has been applied to predict the salinity of the Babol-Rood River, the greatest source of irrigation water in northern Iran. This study uses two individual—M5 Prime (M5P) and random forest (RF)—and eight novel hybrid algorithms—bagging-M5P, bagging-RF, random subspace (RS)-M5P, RS-RF, random committee (RC)-M5P, RC-RF, additive regression (AR)-M5P, and AR-RF—to predict EC. Thirty-six years of observations collected by the Mazandaran Regional Water Authority were randomly divided into two sets: 70% from the period 1980 to 2008 was used as model-training data and 30% from 2009 to 2016 was used as testing data to validate the models. Several water quality variables—pH, HCO3−, Cl−, SO42−, Na+, Mg2+, Ca2+, river discharge (Q), and total dissolved solids (TDS)—were modeling inputs. Using EC and the correlation coefficients (CC) of the water quality variables, a set of nine input combinations were established. TDS, the most effective input variable, had the highest EC-CC (r = 0.91), and it was also determined to be the most important input variable among the input combinations. All models were trained and each model’s prediction power was evaluated with the testing data. Several quantitative criteria and visual comparisons were used to evaluate modeling capabilities. Results indicate that, in most cases, hybrid algorithms enhance individual algorithms’ predictive powers. The AR algorithm enhanced both M5P and RF predictions better than bagging, RS, and RC. M5P performed better than RF. Further, AR-M5P outperformed all other algorithms (R2 = 0.995, RMSE = 8.90 μs/cm, MAE = 6.20 μs/cm, NSE = 0.994 and PBIAS = −0.042). The hybridization of machine learning methods has significantly improved model performance to capture maximum salinity values, which is essential in water resource management.

Download Full-text

Machine Learning in Materials Discovery: Confirmed Predictions and Their Underlying Approaches

Annual Review of Materials Research ◽

10.1146/annurev-matsci-090319-010954 ◽

2020 ◽

Vol 50 (1) ◽

pp. 49-69

Author(s):

James E. Saal ◽

Anton O. Oliynyk ◽

Bryce Meredig

Keyword(s):

Machine Learning ◽

Case Studies ◽

Domain Knowledge ◽

Large Body ◽

Model Performance ◽

Training Data ◽

Materials Informatics ◽

The Core ◽

Core Components ◽

Methodological Considerations

The rapidly growing interest in machine learning (ML) for materials discovery has resulted in a large body of published work. However, only a small fraction of these publications includes confirmation of ML predictions, either via experiment or via physics-based simulations. In this review, we first identify the core components common to materials informatics discovery pipelines, such as training data, choice of ML algorithm, and measurement of model performance. Then we discuss some prominent examples of validated ML-driven materials discovery across a wide variety of materials classes, with special attention to methodological considerations and advances. Across these case studies, we identify several common themes, such as the use of domain knowledge to inform ML models.

Download Full-text