On the data synergy effect of large-sample multi-physics catchment modeling with machine learning

Watersheds in the world are often perceived as being unique from each other, requiring customized study for each basin. Models uniquely built for each watershed, in general, cannot be leveraged for other watersheds. It is also a customary practice in hydrology and related geoscientific disciplines to divide the whole domain into multiple regimes and study each region separately, in an approach sometimes called regionalization or stratification. However, in the era of big-data machine learning, models can learn across regions and identify commonalities and differences. In this presentation, we first show that machine learning can derive highly functional continental-scale models for streamflow, evapotranspiration, and water quality variables. Next, through two hydrologic examples (soil moisture and streamflow), we argue that unification can often significantly outperform stratification, and systematically examine an effect we call data synergy, where the results of the DL models improved when data were pooled together from characteristically different regions and variables. In fact, the performance of the DL models benefited from some diversity in training data even with similar data quantity. However, allowing heterogeneous training data makes eligible much larger training datasets, which is an inherent advantage of DL. We also share our recent developments in advancing hydrologic deep learning and machine learning driven parameterization.

Download Full-text

Cohort-based Federated Learning Services for Industrial Collaboration on the Edge

10.36227/techrxiv.14852361.v1 ◽

2021 ◽

Author(s):

Thomas Hiessl

Keyword(s):

Machine Learning ◽

Software Architecture ◽

Model Performance ◽

Training Data ◽

Skewed Data ◽

Similar Data ◽

Privacy Concerns ◽

Industrial Manufacturing ◽

Share Data ◽

Collaborative Training

<div>Machine Learning (ML) is increasingly applied in industrial manufacturing, but often performance is limited due to insufficient training data. While ML models can benefit from collaboration, due to privacy concerns, individual manufacturers cannot share data directly. Federated Learning (FL) enables collaborative training of ML models without revealing raw data. However, current FL approaches fail to take the characteristics and requirements of industrial clients into account. In this work, we propose a FL system consisting of a process description and a software architecture to provide \acrfull{flaas} to industrial clients deployed to edge devices. Our approach deals with skewed data by organizing clients into cohorts with similar data distributions. We evaluated the system on two industrial datasets. We show how the FLaaS approach provides FL to client processes by considering their requests submitted to the Industrial Federated Learning (IFL) Services API. Experiments on both industrial datasets and different FL algorithms show that the proposed cohort building can increase the ML model performance notably.</div>

Download Full-text

Cohort-based Federated Learning Services for Industrial Collaboration on the Edge

10.36227/techrxiv.14852361 ◽

2021 ◽

Author(s):

Thomas Hiessl

Keyword(s):

Machine Learning ◽

Software Architecture ◽

Model Performance ◽

Training Data ◽

Skewed Data ◽

Similar Data ◽

Privacy Concerns ◽

Industrial Manufacturing ◽

Share Data ◽

Collaborative Training

Download Full-text

A trajectory-based loss function to learn missing terms in bifurcating dynamical systems

Scientific Reports ◽

10.1038/s41598-021-99609-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Rahel Vortmeyer-Kley ◽

Pascal Nieters ◽

Gordon Pipa

Keyword(s):

Machine Learning ◽

Dynamical Systems ◽

State Space ◽

Loss Function ◽

Dynamical Behavior ◽

Period Doubling ◽

Training Data ◽

Dynamical System Theory ◽

Recent Developments ◽

Period Doubling Bifurcations

AbstractMissing terms in dynamical systems are a challenging problem for modeling. Recent developments in the combination of machine learning and dynamical system theory open possibilities for a solution. We show how physics-informed differential equations and machine learning—combined in the Universal Differential Equation (UDE) framework by Rackauckas et al.—can be modified to discover missing terms in systems that undergo sudden fundamental changes in their dynamical behavior called bifurcations. With this we enable the application of the UDE approach to a wider class of problems which are common in many real world applications. The choice of the loss function, which compares the training data trajectory in state space and the current estimated solution trajectory of the UDE to optimize the solution, plays a crucial role within this approach. The Mean Square Error as loss function contains the risk of a reconstruction which completely misses the dynamical behavior of the training data. By contrast, our suggested trajectory-based loss function which optimizes two largely independent components, the length and angle of state space vectors of the training data, performs reliable well in examples of systems from neuroscience, chemistry and biology showing Saddle-Node, Pitchfork, Hopf and Period-doubling bifurcations.

Download Full-text

Training Data Distribution Significantly Impacts the Estimation of Tissue Microstructure with Machine Learning

10.1101/2021.04.13.439659 ◽

2021 ◽

Author(s):

Noemi G. Gyori ◽

Marco Palombo ◽

Christopher A. Clark ◽

Hui Zhang ◽

Daniel C. Alexander

Keyword(s):

Machine Learning ◽

High Precision ◽

Model Fitting ◽

Training Data ◽

Supervised Machine Learning ◽

Parameter Estimates ◽

Traditional Model ◽

Similar Data ◽

Training Set ◽

Accuracy And Precision

AbstractPurposeSupervised machine learning (ML) provides a compelling alternative to traditional model fitting for parameter mapping in quantitative MRI. The aim of this work is to demonstrate and quantify the effect of different training strategies on the accuracy and precision of parameter estimates when supervised ML is used for fitting.MethodsWe fit a two-compartment biophysical model to diffusion measurements from in-vivo human brain, as well as simulated diffusion data, using both traditional model fitting and supervised ML. For supervised ML, we train several artificial neural networks, as well as random forest regressors, on different distributions of ground truth parameters. We compare the accuracy and precision of parameter estimates obtained from the different estimation approaches using synthetic test data.ResultsWhen the distribution of parameter combinations in the training set matches those observed in similar data sets, we observe high precision, but inaccurate estimates for atypical parameter combinations. In contrast, when training data is sampled uniformly from the entire plausible parameter space, estimates tend to be more accurate for atypical parameter combinations but may have lower precision for typical parameter combinations.ConclusionThis work highlights the need to consider the choice of training data when deploying supervised ML for estimating microstructural metrics, as performance depends strongly on the training-set distribution. We show that high precision obtained using ML may mask strong bias, and visual assessment of the parameter maps is not sufficient for evaluating the quality of the estimates.

Download Full-text

Scalable Approach to High Coverages on Oxides via Iterative Training of a Machine-Learning Algorithm

10.26434/chemrxiv.10288514.v1 ◽

2019 ◽

Author(s):

Andrew Medford ◽

Shengchun Yang ◽

Fuzhu Liu

Keyword(s):

Machine Learning ◽

Chemical Potential ◽

Learning Algorithm ◽

Absolute Error ◽

Low Energy ◽

Training Data ◽

High Coverage ◽

Metal Compounds ◽

Adsorption Energies ◽

The Stability

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CHx, NHx and OHx species on the oxygen vacancy and pristine rutile TiO2(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N1.12) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.

Download Full-text

Optimization of Diabetes Training DATA using Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i2.283286 ◽

2018 ◽

Vol 6 (2) ◽

pp. 283-286

Author(s):

M. Samba Siva Rao ◽

◽

M.Yaswanth . ◽

K. Raghavendra Swamy ◽

◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data

Download Full-text

Virtual Screening Meets Deep Learning

Current Computer - Aided Drug Design ◽

10.2174/1573409914666181018141602 ◽

2018 ◽

Vol 15 (1) ◽

pp. 6-28 ◽

Cited By ~ 6

Author(s):

Javier Pérez-Sianes ◽

Horacio Pérez-Sánchez ◽

Fernando Díaz

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Deep Learning ◽

Virtual Screening ◽

Great Increase ◽

New Drugs ◽

Learning Approach ◽

Screening Strategies ◽

Computer Aided ◽

Recent Developments

Background: Automated compound testing is currently the de facto standard method for drug screening, but it has not brought the great increase in the number of new drugs that was expected. Computer- aided compounds search, known as Virtual Screening, has shown the benefits to this field as a complement or even alternative to the robotic drug discovery. There are different methods and approaches to address this problem and most of them are often included in one of the main screening strategies. Machine learning, however, has established itself as a virtual screening methodology in its own right and it may grow in popularity with the new trends on artificial intelligence. Objective: This paper will attempt to provide a comprehensive and structured review that collects the most important proposals made so far in this area of research. Particular attention is given to some recent developments carried out in the machine learning field: the deep learning approach, which is pointed out as a future key player in the virtual screening landscape.

Download Full-text

Comparative Analysis of Machine Learning Techniques Using Predictive Modeling

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904164539 ◽

2020 ◽

Vol 13 ◽

Author(s):

Ritu Khandelwal ◽

Hemlata Goyal ◽

Rajveer Singh Shekhawat

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Data Science ◽

Training Data ◽

Machine Learning Techniques ◽

Future Trends ◽

Data Set ◽

Learning Stage ◽

Learning Techniques ◽

Different Types

Introduction: Machine learning is an intelligent technology that works as a bridge between businesses and data science. With the involvement of data science, the business goal focuses on findings to get valuable insights on available data. The large part of Indian Cinema is Bollywood which is a multi-million dollar industry. This paper attempts to predict whether the upcoming Bollywood Movie would be Blockbuster, Superhit, Hit, Average or Flop. For this Machine Learning techniques (classification and prediction) will be applied. To make classifier or prediction model first step is the learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations. Methods: All the techniques related to classification and Prediction such as Support Vector Machine(SVM), Random Forest, Decision Tree, Naïve Bayes, Logistic Regression, Adaboost, and KNN will be applied and try to find out efficient and effective results. All these functionalities can be applied with GUI Based workflows available with various categories such as data, Visualize, Model, and Evaluate. Result: To make classifier or prediction model first step is learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations Conclusion: This paper focuses on Comparative Analysis that would be performed based on different parameters such as Accuracy, Confusion Matrix to identify the best possible model for predicting the movie Success. By using Advertisement Propaganda, they can plan for the best time to release the movie according to the predicted success rate to gain higher benefits. Discussion: Data Mining is the process of discovering different patterns from large data sets and from that various relationships are also discovered to solve various problems that come in business and helps to predict the forthcoming trends. This Prediction can help Production Houses for Advertisement Propaganda and also they can plan their costs and by assuring these factors they can make the movie more profitable.

Download Full-text

Multilayer Soil Moisture Mapping at a Regional Scale from Multisource Data via a Machine Learning Method

Remote Sensing ◽

10.3390/rs11030284 ◽

2019 ◽

Vol 11 (3) ◽

pp. 284 ◽

Cited By ~ 1

Author(s):

Linglin Zeng ◽

Shun Hu ◽

Daxiang Xiang ◽

Xiang Zhang ◽

Deren Li ◽

...

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Regional Scale ◽

Remotely Sensed ◽

Temporal Variations ◽

Training Data ◽

Estimation Accuracy ◽

Learning Approaches ◽

Remotely Sensed Data ◽

Deep Soil

Soil moisture mapping at a regional scale is commonplace since these data are required in many applications, such as hydrological and agricultural analyses. The use of remotely sensed data for the estimation of deep soil moisture at a regional scale has received far less emphasis. The objective of this study was to map the 500-m, 8-day average and daily soil moisture at different soil depths in Oklahoma from remotely sensed and ground-measured data using the random forest (RF) method, which is one of the machine-learning approaches. In order to investigate the estimation accuracy of the RF method at both a spatial and a temporal scale, two independent soil moisture estimation experiments were conducted using data from 2010 to 2014: a year-to-year experiment (with a root mean square error (RMSE) ranging from 0.038 to 0.050 m3/m3) and a station-to-station experiment (with an RMSE ranging from 0.044 to 0.057 m3/m3). Then, the data requirements, importance factors, and spatial and temporal variations in estimation accuracy were discussed based on the results using the training data selected by iterated random sampling. The highly accurate estimations of both the surface and the deep soil moisture for the study area reveal the potential of RF methods when mapping soil moisture at a regional scale, especially when considering the high heterogeneity of land-cover types and topography in the study area.

Download Full-text

Building Damage Detection from Post-Event Aerial Imagery Using Single Shot Multibox Detector

Applied Sciences ◽

10.3390/app9061128 ◽

2019 ◽

Vol 9 (6) ◽

pp. 1128 ◽

Cited By ~ 12

Author(s):

Yundong Li ◽

Wei Hu ◽

Han Dong ◽

Xueyan Zhang

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Hurricane Sandy ◽

Training Data ◽

Aerial Images ◽

Detection Methods ◽

Single Shot ◽

Data Set ◽

Augmentation Strategies ◽

Post Disaster

Using aerial cameras, satellite remote sensing or unmanned aerial vehicles (UAV) equipped with cameras can facilitate search and rescue tasks after disasters. The traditional manual interpretation of huge aerial images is inefficient and could be replaced by machine learning-based methods combined with image processing techniques. Given the development of machine learning, researchers find that convolutional neural networks can effectively extract features from images. Some target detection methods based on deep learning, such as the single-shot multibox detector (SSD) algorithm, can achieve better results than traditional methods. However, the impressive performance of machine learning-based methods results from the numerous labeled samples. Given the complexity of post-disaster scenarios, obtaining many samples in the aftermath of disasters is difficult. To address this issue, a damaged building assessment method using SSD with pretraining and data augmentation is proposed in the current study and highlights the following aspects. (1) Objects can be detected and classified into undamaged buildings, damaged buildings, and ruins. (2) A convolution auto-encoder (CAE) that consists of VGG16 is constructed and trained using unlabeled post-disaster images. As a transfer learning strategy, the weights of the SSD model are initialized using the weights of the CAE counterpart. (3) Data augmentation strategies, such as image mirroring, rotation, Gaussian blur, and Gaussian noise processing, are utilized to augment the training data set. As a case study, aerial images of Hurricane Sandy in 2012 were maximized to validate the proposed method’s effectiveness. Experiments show that the pretraining strategy can improve of 10% in terms of overall accuracy compared with the SSD trained from scratch. These experiments also demonstrate that using data augmentation strategies can improve mAP and mF1 by 72% and 20%, respectively. Finally, the experiment is further verified by another dataset of Hurricane Irma, and it is concluded that the paper method is feasible.

Download Full-text