scholarly journals Efficient surrogate modeling methods for large-scale Earth system models based on machine learning techniques

2019 ◽  
Author(s):  
Dan Lu ◽  
Daniel Ricciuto

Abstract. Improving predictive understanding of Earth system variability and change requires data-model integration. Efficient data-model integration for complex models requires surrogate modeling to reduce model evaluation time. However, building a surrogate of a large-scale Earth system model (ESM) with many output variables is computationally intensive because it involves a large number of expensive ESM simulations. In this effort, we propose an efficient surrogate method capable of using a few ESM runs to build an accurate and fast-to-evaluate surrogate system of model outputs over large spatial and temporal domains. We first use singular value decomposition to reduce the output dimensions, and then use Bayesian optimization techniques to generate an accurate neural network surrogate model based on limited ESM simulation samples. Our machine learning based surrogate methods can build and evaluate a large surrogate system of many variables quickly. Thus, whenever the quantities of interest change such as a different objective function, a new site, and a longer simulation time, we can simply extract the information of interest from the surrogate system without rebuilding new surrogates, which significantly saves computational efforts. We apply the proposed method to a regional ecosystem model to approximate the relationship between 8 model parameters and 42 660 carbon flux outputs. Results indicate that using only 20 model simulations, we can build an accurate surrogate system of the 42 660 variables, where the consistency between the surrogate prediction and actual model simulation is 0.93 and the mean squared error is 0.02. This highly-accurate and fast-to-evaluate surrogate system will greatly enhance the computational efficiency in data-model integration to improve predictions and advance our understanding of the Earth system.

2019 ◽  
Vol 12 (5) ◽  
pp. 1791-1807 ◽  
Author(s):  
Dan Lu ◽  
Daniel Ricciuto

Abstract. Improving predictive understanding of Earth system variability and change requires data–model integration. Efficient data–model integration for complex models requires surrogate modeling to reduce model evaluation time. However, building a surrogate of a large-scale Earth system model (ESM) with many output variables is computationally intensive because it involves a large number of expensive ESM simulations. In this effort, we propose an efficient surrogate method capable of using a few ESM runs to build an accurate and fast-to-evaluate surrogate system of model outputs over large spatial and temporal domains. We first use singular value decomposition to reduce the output dimensions and then use Bayesian optimization techniques to generate an accurate neural network surrogate model based on limited ESM simulation samples. Our machine-learning-based surrogate methods can build and evaluate a large surrogate system of many variables quickly. Thus, whenever the quantities of interest change, such as a different objective function, a new site, and a longer simulation time, we can simply extract the information of interest from the surrogate system without rebuilding new surrogates, which significantly reduces computational efforts. We apply the proposed method to a regional ecosystem model to approximate the relationship between eight model parameters and 42 660 carbon flux outputs. Results indicate that using only 20 model simulations, we can build an accurate surrogate system of the 42 660 variables, wherein the consistency between the surrogate prediction and actual model simulation is 0.93 and the mean squared error is 0.02. This highly accurate and fast-to-evaluate surrogate system will greatly enhance the computational efficiency of data–model integration to improve predictions and advance our understanding of the Earth system.


Data analytics is an evolving arena in today’s technological evolution. Big data, IoT and machine learning are multidisciplinary fields which pave way for large scale data analytics. Data is the basic ingredient in all type of analytical tasks, which is collected from various sources through online activity. Data divulged in these day-to-day activities contain personal information of individuals. These sensitive details may be disclosed when data is shared with data analysts or researchers for futuristic analysis. In order to respect the privacy of individuals involved, it is required to protect data to avoid any intentional harm. Differential privacy is an algorithm that allows controlled machine learning practices for quality analytics. With differential privacy, the outcome of any analytical task is unaffected by the presence or absence of a single individual or small group of individuals. But, it goes without saying that privacy protection diminishes the usefulness of data for analysis. Hence privacy preserving analytics requires algorithmic techniques that can handle privacy, data quality and efficiency simultaneously. Since one cannot be obtained without degrading the other, an optimal solution that balances the attributes is considered acceptable. The work in this paper, proposes different optimization techniques for shallow and deep learners. While evolutionary approach is proposed for shallow learning, private deep learning is optimized using Bayesian method. The results prove that the Bayesian optimized private deep learning model gives a quantifiable trade-off between the privacy, utility and performance.


2021 ◽  
Author(s):  
Murilo Cruz Lopes ◽  
Marília de Matos Amorim ◽  
Valéria Souza Freitas ◽  
Rodrigo Tripodi Calumby

There is a high incidence of oral cancer in Brazil, with 150,000 new cases estimated for 2020-2022. In most cases, it is diagnosed at an advanced stage and are related to many risk factors. The Registro Hospitalar de Câncer (RHC), managed by Instituto Nacional de Câncer (INCA), is a nation-wide database that integrates cancer registers from several hospitals in Brazil. RHC is mostly an administrative database but also include clinical, socioeconomic and hospitalization data for each patient with a cancer diagnostic in the country. For these patients, prognostication is always a difficult task a demand multi-dimensional analysis. Therefore, exploiting large-scale data and machine intelligence approaches emerge as promising tool for computer-aided decision support on death risk estimation. Given the importance of this context, some works have reported high prognostication effectiveness, however with extremely limited data collections, relying on weak validation protocols or simple robustness analysis. Hence, this work describes a detailed workflow and experimental analysis for oral cancer patient survival prediction considering careful data curation and strict validation procedures. By exploiting multiple machine learning algorithms and optimization techniques the proposed approach allowed promising survival prediction effectiveness with F1 and AuC-ROC over 0.78 and 0.80, respectively. Moreover, a detailed analysis have shown that the minimization of different types of prediction errors were achieved by different models, which highlights the importance of the rigour in this kind of validation.


2017 ◽  
Author(s):  
Md Nazmus Sadat ◽  
Xiaoqian Jiang ◽  
Md Momin Al Aziz ◽  
Shuang Wang ◽  
Noman Mohammed

BACKGROUND Machine learning is an effective data-driven tool that is being widely used to extract valuable patterns and insights from data. Specifically, predictive machine learning models are very important in health care for clinical data analysis. The machine learning algorithms that generate predictive models often require pooling data from different sources to discover statistical patterns or correlations among different attributes of the input data. The primary challenge is to fulfill one major objective: preserving the privacy of individuals while discovering knowledge from data. OBJECTIVE Our objective was to develop a hybrid cryptographic framework for performing regression analysis over distributed data in a secure and efficient way. METHODS Existing secure computation schemes are not suitable for processing the large-scale data that are used in cutting-edge machine learning applications. We designed, developed, and evaluated a hybrid cryptographic framework, which can securely perform regression analysis, a fundamental machine learning algorithm using somewhat homomorphic encryption and a newly introduced secure hardware component of Intel Software Guard Extensions (Intel SGX) to ensure both privacy and efficiency at the same time. RESULTS Experimental results demonstrate that our proposed method provides a better trade-off in terms of security and efficiency than solely secure hardware-based methods. Besides, there is no approximation error. Computed model parameters are exactly similar to plaintext results. CONCLUSIONS To the best of our knowledge, this kind of secure computation model using a hybrid cryptographic framework, which leverages both somewhat homomorphic encryption and Intel SGX, is not proposed or evaluated to this date. Our proposed framework ensures data security and computational efficiency at the same time.


2020 ◽  
Author(s):  
Jin Soo Lim ◽  
Jonathan Vandermause ◽  
Matthijs A. van Spronsen ◽  
Albert Musaelian ◽  
Christopher R. O’Connor ◽  
...  

Restructuring of interface plays a crucial role in materials science and heterogeneous catalysis. Bimetallic systems, in particular, often adopt very different composition and morphology at surfaces compared to the bulk. For the first time, we reveal a detailed atomistic picture of the long-timescale restructuring of Pd deposited on Ag, using microscopy, spectroscopy, and novel simulation methods. Encapsulation of Pd by Ag always precedes layer-by-layer dissolution of Pd, resulting in significant Ag migration out of the surface and extensive vacancy pits. These metastable structures are of vital catalytic importance, as Ag-encapsulated Pd remains much more accessible to reactants than bulk-dissolved Pd. The underlying mechanisms are uncovered by performing fast and large-scale machine-learning molecular dynamics, followed by our newly developed method for complete characterization of atomic surface restructuring events. Our approach is broadly applicable to other multimetallic systems of interest and enables the previously impractical mechanistic investigation of restructuring dynamics.


2021 ◽  
Author(s):  
Norberto Sánchez-Cruz ◽  
Jose L. Medina-Franco

<p>Epigenetic targets are a significant focus for drug discovery research, as demonstrated by the eight approved epigenetic drugs for treatment of cancer and the increasing availability of chemogenomic data related to epigenetics. This data represents a large amount of structure-activity relationships that has not been exploited thus far for the development of predictive models to support medicinal chemistry efforts. Herein, we report the first large-scale study of 26318 compounds with a quantitative measure of biological activity for 55 protein targets with epigenetic activity. Through a systematic comparison of machine learning models trained on molecular fingerprints of different design, we built predictive models with high accuracy for the epigenetic target profiling of small molecules. The models were thoroughly validated showing mean precisions up to 0.952 for the epigenetic target prediction task. Our results indicate that the herein reported models have considerable potential to identify small molecules with epigenetic activity. Therefore, our results were implemented as freely accessible and easy-to-use web application.</p>


Sign in / Sign up

Export Citation Format

Share Document