Ensemble of surrogates and cross-validation for rapid and accurate predictions using small data sets

AbstractIn engineering design, surrogate models are often used instead of costly computer simulations. Typically, a single surrogate model is selected based on the previous experience. We observe, based on an analysis of the published literature, that fitting an ensemble of surrogates (EoS) based on cross-validation errors is more accurate but requires more computational time. In this paper, we propose a method to build an EoS that is both accurate and less computationally expensive. In the proposed method, the EoS is a weighted average surrogate of response surface models, kriging, and radial basis functions based on overall cross-validation error. We demonstrate that created EoS is accurate than individual surrogates even when fewer data points are used, so computationally efficient with relatively insensitive predictions. We demonstrate the use of an EoS using hot rod rolling as an example. Finally, we include a rule-based template which can be used for other problems with similar requirements, for example, the computational time, required accuracy, and the size of the data.

Download Full-text

Using Multiple Surrogates for Minimization of the RMS Error in Meta-Modeling

Volume 1: 34th Design Automation Conference, Parts A and B ◽

10.1115/detc2008-49240 ◽

2008 ◽

Cited By ~ 4

Author(s):

Felipe A. C. Viana ◽

Raphael T. Haftka

Keyword(s):

Cross Validation ◽

Weighted Average ◽

Large Set ◽

High Dimensions ◽

Engineering Problems ◽

Cross Validation Error ◽

Data Points ◽

Meta Modeling ◽

Rms Error ◽

Rms Errors

Surrogate models are commonly used to replace expensive simulations of engineering problems. Frequently, a single surrogate is chosen based on past experience. Previous work has shown that fitting multiple surrogates and picking one based on cross-validation errors (PRESS in particular) is a good strategy, and that cross validation errors may also be used to create a weighted surrogate. In this paper, we discuss whether to use the best PRESS solution or a weighted surrogate when a single surrogate is needed. We propose the minimization of the integrated square error as a way to compute the weights of the weighted average surrogate. We find that it pays to generate a large set of different surrogates and then use PRESS as a criterion for selection. We find that the cross validation error vectors provide an excellent estimate of the RMS errors when the number of data points is high. Hence the use of cross validation errors for choosing a surrogate and for calculating the weights of weighted surrogates becomes more attractive in high dimensions. However, it appears that the potential gains from using weighted surrogates diminish substantially in high dimensions.

Download Full-text

Estimating RNA dynamics using one time point for one sample in a single-pulse metabolic experiment

10.1101/2020.05.01.071779 ◽

2020 ◽

Author(s):

Micha Hersch ◽

Adriano Biasini ◽

Ana Claudia Marques ◽

Sven Bergmann

Keyword(s):

Single Pulse ◽

Single Sample ◽

Computational Time ◽

Published Data ◽

Data Sets ◽

Metabolic Labeling ◽

Turnover Rates ◽

Computationally Efficient ◽

Degradation Rates ◽

Wide Scale

AbstractOver the past decade, experimental procedures such as metabolic labeling for determining RNA turnover rates at the transcriptome-wide scale have been widely adopted. Several computational methods to estimate RNA processing and degradation rates from such experiments have been suggested, but they all require several RNA sequencing samples. Here we present a method that can estimate RNA synthesis, processing and degradation rates from a single sample. To this end, we use the Zeisel model and take advantage of its analytical solution, reducing the problem to solving a univariate non-linear equation on a bounded domain. This makes our method computationally efficient, while enabling inference of rates that correlate well with previously published data sets. Using our approach on a single sample, we were able to reproduce and extend the observation that dynamic biological processes such as transcription or chromatin modifications tend to involve genes with higher metabolic rates, while stable processes such as basic metabolism involve genes with lower rates.In addition to saving experimental work and computational time, having a sample-based rate estimation has several advantages. It does not require an error-prone normalization across samples and enables the use of replicates to estimate uncertainty and perform quality control. Finally the method and theoretical results described here are general enough to be useful in other settings such as nucleotide conversion methods.

Download Full-text

Parameter Estimation in Models of Cell Survival Using Scaled Time

Volume 1B: Extremity; Fluid Mechanics; Gait; Growth, Remodeling, and Repair; Heart Valves; Injury Biomechanics; Mechanotransduction and Sub-Cellular Biophysics; MultiScale Biotransport; Muscle, Tendon and Ligament; Musculoskeletal Devices; Multiscale Mechanics; Thermal Medicine; Ocular Biomechanics; Pediatric Hemodynamics; Pericellular Phenomena; Tissue Mechanics; Biotransport Design and Devices; Spine; Stent Device Hemodynamics; Vascular Solid Mechanics; Student Paper and Design Competitions ◽

10.1115/sbc2013-14810 ◽

2013 ◽

Author(s):

Neil T. Wright

Keyword(s):

Cell Survival ◽

Treatment Time ◽

Treatment Temperature ◽

Separate Experiment ◽

Small Data ◽

Data Sets ◽

Data Points ◽

Isothermal Analysis ◽

Time Required ◽

Few Data

Many individual samples are needed to measure cell survival following heating at multiple temperatures and multiple heating durations. For example, if eight time points are considered for each of seven treatment temperatures with three replicates at each condition, then 168 separate samples are needed. In addition, physical considerations may limit the number of points that can be measured, especially as treatment temperature increases and the heating duration decreases. For a reasonable sample size, there may be a limit to the treatment temperature as the time required to heat the culture to the target temperature becomes comparable to the treatment time. Then, using an isothermal analysis of the data introduces error and the temperature must be considered time varying, requiring estimates of the very parameters being sought. Conversely, for long treatment times, it may be difficult to insure that the temperature remains constant and that the temperature is the only modified experimental condition in the culture medium. These challenges typically lead to relatively small data sets. Furthermore, treating each temperature as a separate experiment leads to challenging statistical analysis of the data, as the few data points lead to difficulty in finding the confidence intervals of the parameters in a given model.

Download Full-text

An Efficient and Accurate Algorithm for Computing Grid-Averaged Solar Fluxes for Horizontally Inhomogeneous Clouds

Journal of the Atmospheric Sciences ◽

10.1175/jas-d-20-0167.1 ◽

2020 ◽

Author(s):

Zhonghai Jin ◽

Andrew Lacis

Keyword(s):

Optical Depth ◽

Computational Time ◽

Asymmetry Factor ◽

Computationally Efficient ◽

Transfer Scheme ◽

Single Plane ◽

Cloud Optical Depth ◽

Plane Parallel ◽

Special Cases ◽

Time Required

AbstractA computationally efficient method is presented to account for the horizontal cloud inhomogeneity by using a radiatively equivalent plane parallel homogeneous (PPH) cloud. The algorithm can accurately match the calculations of the reference (rPPH) independent column approximation (ICA) results, but use only the same computational time required for a single plane parallel computation. The effective optical depth of this synthetic sPPH cloud is derived by exactly matching the direct transmission to that of the inhomogeneous ICA cloud. The effective scattering asymmetry factor is found from a pre-calculated albedo inverse look-up-table that is allowed to vary over the range from -1.0 to 1.0. In the special cases of conservative scattering and total absorption, the synthetic method is exactly equivalent to the ICA, with only a small bias (about 0.2% in flux) relative to ICA due to imperfect interpolation in using the look-up tables. In principlel, the ICA albedo can be approximated accurately regardless of cloud inhomogeneity. For a more complete comparison, the broadband shortwave albedo and transmission calculated from the synthetic sPPH cloud and averaged over all incident directions, have the RMS biases of 0.26% and 0.76%, respectively, for inhomogeneous clouds over a wide variation of particle size. The advantages of the synthetic PPH method are that (1) it is not required that all the cloud subcolumns have uniform microphysical characteristic, (2) it is applicable to any 1D radiative transfer scheme, and (3) it can handle arbitrary cloud optical depth distributions and an arbitrary number of cloud subcolumns with uniform computational efficiency.

Download Full-text

New Workflow for QSAR Model Development from Small Data Sets: Small Dataset Curator and Small Dataset Modeler. Integration of Data Curation, Exhaustive Double Cross-Validation, and a Set of Optimal Model Selection Techniques

Journal of Chemical Information and Modeling ◽

10.1021/acs.jcim.9b00476 ◽

2019 ◽

Vol 59 (10) ◽

pp. 4070-4076 ◽

Cited By ~ 8

Author(s):

Pravin Ambure ◽

Agnieszka Gajewicz-Skretna ◽

M. Natalia D. S. Cordeiro ◽

Kunal Roy

Keyword(s):

Cross Validation ◽

Model Development ◽

Qsar Model ◽

Data Curation ◽

Small Data ◽

Data Sets ◽

Optimal Model ◽

Small Data Sets ◽

Small Dataset ◽

Double Cross

Download Full-text

Random Node Pair Sampling-Based Estimation of Average Path Lengths in Networks

International Journal of Operations Research and Information Systems ◽

10.4018/ijoris.2018070102 ◽

2018 ◽

Vol 9 (3) ◽

pp. 27-51

Author(s):

Luis E. Castro ◽

Nazrul I. Shaikh

Keyword(s):

Path Length ◽

Network Size ◽

Computational Time ◽

Computationally Efficient ◽

Node Pair ◽

Random Node ◽

Speed Up ◽

Sampling Algorithms ◽

Path Lengths ◽

Time Required

This article describes how the average path length (APL) of a network is an important metric that provides insights on the interconnectivity in a network and how much time and effort would be required for search and navigation on that network. However, the estimation of APL is time-consuming as its computational complexity scales nonlinearly with the network size. In this article, the authors develop a computationally efficient random node pair sampling algorithm that enables the estimation of APL with a specified precision and confidence. The proposed sampling algorithms provide a speed-up factor ranging from 240-750 for networks with more than 100,000 nodes. The authors also find that the computational time required for estimation APL does not necessarily increase with the network size; it shows an inverted U shape instead.

Download Full-text

Computational performance and cross-validation error precision of five PLS algorithms using designed and real data sets

Journal of Chemometrics ◽

10.1002/cem.1309 ◽

2010 ◽

pp. n/a-n/a ◽

Cited By ~ 1

Author(s):

João Paulo A. Martins ◽

Reinaldo F. Teófilo ◽

Márcia M. C. Ferreira

Keyword(s):

Cross Validation ◽

Real Data ◽

Data Sets ◽

Computational Performance ◽

Cross Validation Error

Download Full-text

Overtraining in neural networks that interpret clinical data

Clinical Chemistry ◽

10.1093/clinchem/39.9.1998 ◽

1993 ◽

Vol 39 (9) ◽

pp. 1998-2004 ◽

Cited By ~ 27

Author(s):

M L Astion ◽

M H Wener ◽

R G Thomas ◽

G G Hunder ◽

D A Bloch

Keyword(s):

Neural Networks ◽

Pattern Recognition ◽

Clinical Data ◽

Cross Validation ◽

Data Sets ◽

Cross Validation Error ◽

Computer Based ◽

The Cross ◽

Training Error ◽

Validation Set

Abstract Backpropagation neural networks are a computer-based pattern-recognition method that has been applied to the interpretation of clinical data. Unlike rule-based pattern recognition, backpropagation networks learn by being repetitively trained with examples of the patterns to be differentiated. We describe and analyze the phenomenon of overtraining in backpropagation networks. Overtraining refers to the reduction in generalization ability that can occur as networks are trained. The clinical application we used was the differentiation of giant cell arteritis (GCA) from other forms of vasculitis (OTH) based on results for 807 patients (593 OTH, 214 GCA) and eight clinical predictor variables. The 807 cases were randomly assigned to either a training set with 404 cases or to a cross-validation set with the remaining 403 cases. The cross-validation set was used to monitor generalization during training. Results were obtained for eight networks, each derived from a different random assignment of the 807 cases. Training error monotonically decreased during training. In contrast, the cross-validation error usually reached a minimum early in training while the training error was still decreasing. Training beyond the minimum cross-validation error was associated with an increased cross-validation error. The shape of the cross-validation error curve and the point during training corresponding to the minimum cross-validation error varied with the composition of the data sets and the training conditions. The study indicates that training error is not a reliable indicator of a network's ability to generalize. To find the point during training when a network generalizes best, one must monitor cross-validation error separately.

Download Full-text

A Line Heat Input Model for Additive Manufacturing

Volume 1: Processing ◽

10.1115/msec2015-9240 ◽

2015 ◽

Cited By ~ 1

Author(s):

Jeff Irwin ◽

P. Michaleris

Keyword(s):

Additive Manufacturing ◽

Heat Source ◽

Heat Input ◽

Source Model ◽

Computational Time ◽

Computationally Efficient ◽

Powder Bed ◽

Time Increment ◽

Input Model ◽

Time Required

A line input model has been developed which makes the accurate modeling of powder bed processes more computationally efficient. Goldak’s ellipsoidal model has been used extensively to model heat sources in additive manufacturing, including lasers and electron beams. To accurately model the motion of the heat source, the simulation time increments must be small enough such that the source moves a distance smaller than its radius over the course of each increment. When the source radius is small and its velocity is large, a strict condition is imposed on the size of time increments regardless of any stability criteria. In powder bed systems, where radii of 0.1 mm and velocities of 500 mm/s are typical, a significant computational burden can result. The line heat input model relieves this burden by averaging the heat source over its path. This model allows the simulation of an entire heat source scan in just one time increment. However, such large time increments can lead to inaccurate results. Instead, the scan is broken up into several linear segments, each of which is applied in one increment. In this work, time increments are found that yield accurate results (less than 10 % displacement error) and require less than 1/10 of the CPU time required by Goldak’s moving source model. A dimensionless correlation is given that can be used to determine the necessary time increment size that will greatly decrease the computational time required for any powder bed simulation while maintaining accuracy.

Download Full-text

Time consumption and quality of an automated fusion tool for SPECT and MRI images of the brain

Nuklearmedizin ◽

10.1055/s-0038-1625192 ◽

2003 ◽

Vol 42 (05) ◽

pp. 215-219

Author(s):

G. Platsch ◽

A. Schwarz ◽

K. Schmiedehausen ◽

B. Tomandl ◽

W. Huk ◽

...

Keyword(s):

Data Sets ◽

3D Mri ◽

Clinical Routine ◽

Fusion Procedure ◽

3D Data ◽

Mri Scans ◽

2D Data ◽

Time Required ◽

The Brain ◽

Manual Correction

Summary: Aim: Although the fusion of images from different modalities may improve diagnostic accuracy, it is rarely used in clinical routine work due to logistic problems. Therefore we evaluated performance and time needed for fusing MRI and SPECT images using a semiautomated dedicated software. Patients, material and Method: In 32 patients regional cerebral blood flow was measured using 99mTc ethylcystein dimer (ECD) and the three-headed SPECT camera MultiSPECT 3. MRI scans of the brain were performed using either a 0,2 T Open or a 1,5 T Sonata. Twelve of the MRI data sets were acquired using a 3D-T1w MPRAGE sequence, 20 with a 2D acquisition technique and different echo sequences. Image fusion was performed on a Syngo workstation using an entropy minimizing algorithm by an experienced user of the software. The fusion results were classified. We measured the time needed for the automated fusion procedure and in case of need that for manual realignment after automated, but insufficient fusion. Results: The mean time of the automated fusion procedure was 123 s. It was for the 2D significantly shorter than for the 3D MRI datasets. For four of the 2D data sets and two of the 3D data sets an optimal fit was reached using the automated approach. The remaining 26 data sets required manual correction. The sum of the time required for automated fusion and that needed for manual correction averaged 320 s (50-886 s). Conclusion: The fusion of 3D MRI data sets lasted significantly longer than that of the 2D MRI data. The automated fusion tool delivered in 20% an optimal fit, in 80% manual correction was necessary. Nevertheless, each of the 32 SPECT data sets could be merged in less than 15 min with the corresponding MRI data, which seems acceptable for clinical routine use.

Download Full-text