Speeding up reactive transport simulations: statistical surrogates and caching of simulation results in lookup tables

A successful strategy for speeding up coupled reactive transport simulations at price of acceptable accuracy loss is to compute geochemistry, which represents the bottleneck of these simulations, through data-driven surrogates instead of &#8216;full physics&#8216; equation-based models [1]. A surrogate is a multivariate regressor trained on a set of pre-calculated geochemical simulations or potentially even at runtime during the coupled simulations. Many available algorithms and implementations are available from the thriving Machine Learning community: tree-based regressors such as Random Forests or xgboost, Artificial Neural Networks, Gaussian Processes and Support Vector Machines just to name a few. Given the &#8216;black-box&#8216; nature of the surrogates, however, they generally disregard physical constraints such as mass and charge balance, which are of course of paramount importance for coupled transport simulations. A runtime check of error of balances in the surrogate outcomes is therefore necessary: predictions offending a given tolerance must be rejected and the full physics chemical simulations run instead. Thus the practical speedup of this strategy is a tradeoff between careful training of the surrogate and run-time efficiency. In this contribution we demonstrate that the use of surrogates can lead to a dramatic decrease of required computing time, with speedup factors in the order of 10 or even 100 in the most favorable cases. Thus, large scale simulations with some 106 grid elements are feasible on common workstations without requiring computation on HPC clusters [2]. Furthermore, we showcase our implementation of Distributed Hash Tables caching geochemical simulation results for further reuse in subsequent time steps. The computational advantage here stems from the fact that query and retrieval from lookup tables is much faster than both full physics geochemical simulations and surrogate predictions. Another advantage of this algorithm is that virtually no loss of accuracy is introduced in the simulations. Enabling the caching of geochemical simulations through DHT speeds up large scale reactive transport simulations up to a factor of four even when computing on several hundred cores. These algorithmical developments are demonstrated in comparison with published reactive transport benchmarks and on a real-life scenario of CO2 storage.&#160;&#160;[1] Jatnieks, J., De Lucia, M., Dransch, D., Sips, M. (2016): Data-driven surrogate model approach for improving the performance of reactive transport simulations. Energy Procedia 97, pp. 447-453. DOI: 10.1016/j.egypro.2016.10.047[2] De Lucia, M., Kempka, T., Jatnieks, J., K&#252;hn, M. (2017): Integrating surrogate models into subsurface simulation framework allows computation of complex reactive transport scenarios. Energy Procedia 125, pp. 580-587. DOI: 10.1016/j.egypro.2017.08.200

Download Full-text

Solving the Large-Scale TSP Problem in 1 h: Santa Claus Challenge 2020

Frontiers in Robotics and AI ◽

10.3389/frobt.2021.689908 ◽

2021 ◽

Vol 8 ◽

Author(s):

Radu Mariescu-Istodor ◽

Pasi Fränti

Keyword(s):

Large Scale ◽

Computing Time ◽

Real Life ◽

Problem Instance ◽

Santa Claus ◽

Design Choice ◽

Long Time ◽

Problem Instances ◽

The Given ◽

Important Design

The scalability of traveling salesperson problem (TSP) algorithms for handling large-scale problem instances has been an open problem for a long time. We arranged a so-called Santa Claus challenge and invited people to submit their algorithms to solve a TSP problem instance that is larger than 1 M nodes given only 1 h of computing time. In this article, we analyze the results and show which design choices are decisive in providing the best solution to the problem with the given constraints. There were three valid submissions, all based on local search, including k-opt up to k = 5. The most important design choice turned out to be the localization of the operator using a neighborhood graph. The divide-and-merge strategy suffers a 2% loss of quality. However, via parallelization, the result can be obtained within less than 2 min, which can make a key difference in real-life applications.

Download Full-text

iTrade

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2015010104 ◽

2015 ◽

Vol 11 (1) ◽

pp. 66-83 ◽

Cited By ~ 4

Author(s):

Yong Hu ◽

Xiangzhou Zhang ◽

Bin Feng ◽

Kang Xie ◽

Mei Liu

Keyword(s):

Stock Market ◽

Large Scale ◽

Concept Drift ◽

Composite Index ◽

Performance Comparison ◽

Mobile App ◽

Data Driven ◽

Support Vector ◽

Individual Investors ◽

Chinese Stock Market

Among all investors in the Chinese stock market, more than 95% are non-professional individual investors. These individual investors are in great need of mobile apps that can provide professional and handy trading analysis and decision support everywhere. However, financial data is challenging to analyze because of its large-scale, non-linear and noisy characteristics in a varying stock environment. This paper develops a Mobile Data-Driven Stock Trading System (iTrade), which is a mobile app system based on Client-Server architecture and various data mining techniques. The iTrade is characterized by 1) a data-driven intelligent learning model, which can provide further insight compared to empirical technical analysis, 2) a concept drift adaptation process, which facilitates the model adaptation to market structure changes, and 3) a rigorous benchmark analysis, including the Buy-and-Hold strategy and the strategies of three world-famous master investors (e.g., Warren E. Buffett). Technologies used in iTrade include the Least Absolute Shrinkage and Selection Operator (Lasso) algorithm, Support Vector Machine (SVM) and risk-adjusted portfolio optimization. An application case of iTrade is presented, which is based on a seven-year (2005-2011) back-testing. Evaluation results indicated that iTrade could gain much higher cumulative return compared to the benchmark (Shanghai Composite Index). To the best of our knowledge, this is the first study and mobile app system that emphasizes and investigates the concept drift phenomenon in stock market, as well as the performance comparison between data-driven intelligent model and strategies of master investors.

Download Full-text

The Mechanism Analysis of the Accelerator for Support Vector Regression Based on Data Partition

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200730 ◽

2020 ◽

Author(s):

Yunsheng Song ◽

Fangyi Li ◽

Jianyu Liu ◽

Juao Zhang

Keyword(s):

Support Vector Regression ◽

Large Scale ◽

Real Life ◽

Support Vector ◽

Mechanism Analysis ◽

Data Partition ◽

Price Forecast ◽

Before And After ◽

Important Solution ◽

The Difference

Support vector regression is an important algorithm in machine learning, and it is widely used in real life for its good performance, such as house price forecast, disease prediction, weather forecast, and so on. However, it cannot efficiently process large-scale data, because it has a high time complexity in the training process. Data partition as an important solution to solve the large-scale learning problem mainly focuses on the classification task, it trains the classifiers over the divided subsets produced by data partition and obtain the final classifier by combining those classifiers. Meanwhile, the most existing method rarely study the influence of data partition on the regressor performance, so that it is difficult to keep its generation ability. To solve this problem, we obtain the estimation of the difference in objective function before and after the data partition. Mini-Batch K-Means clustering is adopted to largely reduce this difference, and an improved algorithm is proposed. This proposed algorithm includes training stage and prediction stage. In training stag, it uses Mini-Batch K-Means clustering to divide the input space into some disjoint sub-regions of equal sample size, then it trains the regressor on each divided sub-region using support vector regression algorithm. In the prediction stage, the regressor merely offers the predicted label for the unlabeled instances that are in the same sub-region. Experiment results on real datasets illustrate that the proposed algorithm obtains the similar generation ability as the original algorithm, but it has less execution time than other acceleration algorithms.

Download Full-text

Support Vector Machine optimization with fractional gradient descent for data classification

Journal of Applied Sciences, Management and Engineering Technology ◽

10.31284/j.jasmet.2021.v2i1.1467 ◽

2021 ◽

Vol 2 (1) ◽

pp. 1-6

Author(s):

Dian Puspita Hapsari ◽

Imam Utoyo ◽

Santi Wulan Purnami

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Computing Time ◽

Data Classification ◽

Optimization Method ◽

Descent Method ◽

Computational Time ◽

Support Vector ◽

Svm Classifier ◽

Gradient Descent Method

Data classification has several problems one of which is a large amount of data that will reduce computing time. SVM is a reliable linear classifier for linear or non-linear data, for large-scale data, there are computational time constraints. The Fractional gradient descent method is an unconstrained optimization algorithm to train classifiers with support vector machines that have convex problems. Compared to the classic integer-order model, a model built with fractional calculus has a significant advantage to accelerate computing time. In this research, it is to conduct investigate the current state of this new optimization method fractional derivatives that can be implemented in the classifier algorithm. The results of the SVM Classifier with fractional gradient descent optimization, it reaches a convergence point of approximately 50 iterations smaller than SVM-SGD. The process of updating or fixing the model is smaller in fractional because the multiplier value is less than 1 or in the form of fractions. The SVM-Fractional SGD algorithm is proven to be an effective method for rainfall forecast decisions.

Download Full-text

ReAl-LiFE: Accelerating the Discovery of Individualized Brain Connectomes on GPUs

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301630 ◽

2019 ◽

Vol 33 ◽

pp. 630-638 ◽

Cited By ~ 2

Author(s):

Sawan Kumar ◽

Varsha Sreenivasan ◽

Partha Talukdar ◽

Franco Pestilli ◽

Devarajan Sridharan

Keyword(s):

Diffusion Imaging ◽

Computing Time ◽

Structural Connectivity ◽

Real Life ◽

Support Vector ◽

Healthy Controls ◽

Linear Classifiers ◽

Generalized Matrix ◽

Access Patterns

Diffusion imaging and tractography enable mapping structural connections in the human brain, in-vivo. Linear Fascicle Evaluation (LiFE) is a state-of-the-art approach for pruning spurious connections in the estimated structural connectome, by optimizing its fit to the measured diffusion data. Yet, LiFE imposes heavy demands on computing time, precluding its use in analyses of large connectome databases. Here, we introduce a GPU-based implementation of LiFE that achieves 50-100x speedups over conventional CPU-based implementations for connectome sizes of up to several million fibers. Briefly, the algorithm accelerates generalized matrix multiplications on a compressed tensor through efficient GPU kernels, while ensuring favorable memory access patterns. Leveraging these speedups, we advance LiFE’s algorithm by imposing a regularization constraint on estimated fiber weights during connectome pruning. Our regularized, accelerated, LiFE algorithm (“ReAl-LiFE”) estimates sparser connectomes that also provide more accurate fits to the underlying diffusion signal. We demonstrate the utility of our approach by classifying pathological signatures of structural connectivity in patients with Alzheimer’s Disease (AD). We estimated million fiber whole-brain connectomes, followed by pruning with ReAl-LiFE, for 90 individuals (45 AD patients and 45 healthy controls). Linear classifiers, based on support vector machines, achieved over 80% accuracy in classifying AD patients from healthy controls based on their ReAl-LiFE pruned structural connectomes alone. Moreover, classification based on the ReAl-LiFE pruned connectome outperformed both the unpruned connectome, as well as the LiFE pruned connectome, in terms of accuracy. We propose our GPU-accelerated approach as a widely relevant tool for non-negative least squares optimization, across many domains.

Download Full-text

A Parallel Mixture of SVMs for Very Large Scale Problems

Neural Computation ◽

10.1162/089976602753633402 ◽

2002 ◽

Vol 14 (5) ◽

pp. 1105-1114 ◽

Cited By ~ 166

Author(s):

Ronan Collobert ◽

Samy Bengio ◽

Yoshua Bengio

Keyword(s):

Large Scale ◽

Real Life ◽

Support Vector ◽

Small Subset ◽

Training Algorithm ◽

Classification Problems ◽

Data Set ◽

Large Scale Problems ◽

Life Problems ◽

Vector Machines

Support vector machines (SVMs) are the state-of-the-art models for many classification problems, but they suffer from the complexity of their training algorithm, which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundred thousand examples with SVMs. This article proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole data set. Experiments on a large benchmark data set (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and surprisingly, a significant improvement in generalization was observed.

Download Full-text

Data-Driven Kernel Extreme Learning Machine Method for the Location and Capacity Planning of Distributed Generation

Energies ◽

10.3390/en12010109 ◽

2018 ◽

Vol 12 (1) ◽

pp. 109 ◽

Cited By ~ 4

Author(s):

Jingjing Tu ◽

Yonghai Xu ◽

Zhongdong Yin

Keyword(s):

Power Generation ◽

Distribution System ◽

Large Scale ◽

Voltage Stability ◽

Distribution Network ◽

Peak Load ◽

Data Driven ◽

Support Vector ◽

Machine Method ◽

Photovoltaic Power

For the integration of distributed generations such as large-scale wind and photovoltaic power generation, the characteristics of the distribution network are fundamentally changed. The intermittence, variability, and uncertainty of wind and photovoltaic power generation make the adjustment of the network peak load and the smooth control of power become the key issues of the distribution network to accept various types of distributed power. This paper uses data-driven thinking to describe the uncertainty of scenery output, and introduces it into the power flow calculation of distribution network with multi-class DG, improving the processing ability of data, so as to better predict DG output. For the problem of network stability and operational control complexity caused by DG access, using KELM algorithm to simplify the complexity of the model and improve the speed and accuracy. By training and testing the KELM model, various DG configuration schemes that satisfy the minimum network loss and constraints are given, and the voltage stability evaluation index is introduced to evaluate the results. The general recommendation for DG configuration is obtained. That is, DG is more suitable for accessing the lower point of the network voltage or the end of the network. By configuring the appropriate capacity, it can reduce the network loss, improve the network voltage stability, and the quality of the power supply. Finally, the IEEE33&69-bus radial distribution system is used to simulate, and the results are compared with the existing particle swarm optimization (PSO), genetic algorithm (GA), and support vector machine (SVM). The feasibility and effectiveness of the proposed model and method are verified.

Download Full-text

Food image recognition based on Mobile NetV2 using support vector machine

10.21467/proceedings.114.27 ◽

2021 ◽

Author(s):

Sapna Yadav ◽

Satish Chand

Keyword(s):

Support Vector Machine ◽

Image Recognition ◽

Health Monitoring ◽

Dietary Assessment ◽

Real Life ◽

Support Vector ◽

Training Time ◽

Food Image ◽

Simulation Results ◽

Computational Platform

The rapid growth in deep learning has made convolutional neural networks deeper and more complex to realize higher accuracy. But many day-to-day recognition tasks need be performed in a limited computational platform. One of the applications is food image recognition which is very helpful in individual’s health monitoring, dietary assessment, nutrition analysis etc. This task needs small convolutional neural network based engine to do computations fast and accurate. MoblieNetV2 being simple and smaller in size can incorporate easily into small end devices. In this paper, MobileNetV2 and support vector machine are used to classify the food images. Simulation results show that the features extracted from Conv_1 layer, out_relu layer and Conv_1_bn layer of MobileNetV2 and classified using Support Vector Machine have achieved classification accuracies of 84.0%, 87.27% and 83.60% respectively. Because of fewer parameters, smaller size and lesser training time, MobileNetV2 is an excellent choice for real-life recognition tasks.

Download Full-text

Manifold Learning and Clustering for Automated Phase Identification and Alignment in Data Driven Modeling of Batch Processes

Frontiers in Chemical Engineering ◽

10.3389/fceng.2020.582126 ◽

2020 ◽

Vol 2 ◽

Author(s):

Carlos André Muñoz López ◽

Satyajeet Bhonsale ◽

Kristin Peeters ◽

Jan F. M. Van Impe

Keyword(s):

Manifold Learning ◽

Large Scale ◽

Linear Method ◽

Phase Changes ◽

Data Driven ◽

Support Vector ◽

Phase Identification ◽

Scale Production ◽

Non Linear ◽

Data Driven Modeling

Processing data that originates from uneven, multi-phase batches is a challenge in data-driven modeling. Training predictive and monitoring models requires the data to be in the right shape to be informative. Only then can a model learn meaningful features that describe the deterministic variability of the process. The presence of multiple phases in the data, which display different correlation patterns and have an uneven duration from batch to batch, reduces the performance of the data-driven modeling methods significantly. Therefore, phase identification and alignment is a critical step and can lead to an unsuccessful modeling exercise if not applied correctly. In this paper, a novel approach is proposed to perform unsupervised phase identification and alignment based on the correlation patterns found in the data. Phase identification is performed via manifold learning using t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a state-of-the-art machine learning algorithm for non-linear dimensionality reduction. The application of t-SNE to a reduced cross-correlation matrix of every batch with respect to a reference batch results in data clustering in the embedded space. Models based on support vector machines (SVMs) are trained to, 1) reproduce the manifold learning obtained via t-SNE, and 2) determine the membership of the data points to a process phase. Compared to previously proposed clustering approaches for phase identification, this is an unsupervised, non-linear method. The perplexity parameter of the t-SNE algorithm can be interpreted as the estimated duration of the shortest phase in the process. The advantages of the proposed method are demonstrated through its application on an in-silico benchmark case study, and on real industrial data from two unit-operations in the large scale production of an active pharmaceutical ingredients (API). The efficacy and robustness of the method are evidenced in the successful phase identification and alignment obtained for these three distinct processes, displaying smooth, sudden and repetitive phase changes. Additionally, the low complexity of the method makes feasible its online implementation.

Download Full-text