The algorithm for generating the training set for the problem of elastoplastic deformation of the metal rod

Abstract The paper proposes an algorithm for forming a small training set, which will provide a reasonable quality of a surrogate ML-model for the problem of elastoplastic deformation of a metal rod under the action of a longitudinal load pulse. This dynamic physical problem is computationally simple and convenient for testing various approaches, but at the same time it is physically quite complex, because it contains a significant range of effects. So, the methods tested on this problem can be further applied to other areas. This work demonstrates the possibility of a surrogate ML-model to provide a reasonable prediction quality for a dynamic physical problem with a small training set size.

Download Full-text

Training Set Size and Response Location Effects on Same/Different Judgments in Humans

PsycEXTRA Dataset ◽

10.1037/e520602012-170 ◽

2011 ◽

Author(s):

Jeffrey S. Katz ◽

John F. Magnotti ◽

Anthony A. Wright

Keyword(s):

Training Set ◽

Response Location ◽

Set Size

Download Full-text

Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data

Remote Sensing ◽

10.3390/rs13030368 ◽

2021 ◽

Vol 13 (3) ◽

pp. 368

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell ◽

Bradley S. Price

Keyword(s):

Machine Learning ◽

Sample Size ◽

Remotely Sensed ◽

Training Data ◽

Supervised Machine Learning ◽

Sample Sizes ◽

Remotely Sensed Data ◽

Large Area ◽

Training Set ◽

Set Size

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

A Hybrid Alchemical Free Energy and Machine Learning Methodology for the Calculation of Absolute Hydration Free Energies of Small Molecules

10.26434/chemrxiv.12380612 ◽

2020 ◽

Author(s):

Jenke Scheen ◽

Wilson Wu ◽

Antonia S. J. S. Mey ◽

Paolo Tosco ◽

Mark Mackey ◽

...

Keyword(s):

Machine Learning ◽

Free Energy ◽

Free Energy Calculations ◽

Learning Approaches ◽

Free Energies ◽

Training Set ◽

Set Size ◽

Correction Terms ◽

Alchemical Free Energy ◽

Alchemical Free Energy Calculations

A methodology that combines alchemical free energy calculations (FEP) with machine learning (ML) has been developed to compute accurate absolute hydration free energies. The hybrid FEP/ML methodology was trained on a subset of the FreeSolv database, and retrospectively shown to outperform most submissions from the SAMPL4 competition. Compared to pure machine-learning approaches, FEP/ML yields more precise estimates of free energies of hydration, and requires a fraction of the training set size to outperform standalone FEP calculations. The ML-derived correction terms are further shown to be transferable to a range of related FEP simulation protocols. The approach may be used to inexpensively improve the accuracy of FEP calculations, and to flag molecules which will benefit the most from bespoke forcefield parameterisation efforts.

Download Full-text

The Importance of the Test Set Size in Quantification Assessment

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/366 ◽

2020 ◽

Cited By ~ 1

Author(s):

André Maletzke ◽

Waqar Hassan ◽

Denis dos Reis ◽

Gustavo Batista

Keyword(s):

Performance Measures ◽

Training Set ◽

Test Set ◽

Test Size ◽

Critical Variable ◽

Set Size ◽

Quantification Method ◽

Class Distribution ◽

Cherry Picking ◽

Test Sets

Quantification is a task similar to classification in the sense that it learns from a labeled training set. However, quantification is not interested in predicting the class of each observation, but rather measure the class distribution in the test set. The community has developed performance measures and experimental setups tailored to quantification tasks. Nonetheless, we argue that a critical variable, the size of the test sets, remains ignored. Such disregard has three main detrimental effects. First, it implicitly assumes that quantifiers will perform equally well for different test set sizes. Second, it increases the risk of cherry-picking by selecting a test set size for which a particular proposal performs best. Finally, it disregards the importance of designing methods that are suitable for different test set sizes. We discuss these issues with the support of one of the broadest experimental evaluations ever performed, with three main outcomes. (i) We empirically demonstrate the importance of the test set size to assess quantifiers. (ii) We show that current quantifiers generally have a mediocre performance on the smallest test sets. (iii) We propose a metalearning scheme to select the best quantifier based on the test size that can outperform the best single quantification method.

Download Full-text

Improving prediction quality of sea surface temperature (SST) in Niño3.4 region using Bayesian Model Averaging

IOP Conference Series Earth and Environmental Science ◽

10.1088/1755-1315/893/1/012028 ◽

2021 ◽

Vol 893 (1) ◽

pp. 012028

Author(s):

Robi Muharsyah ◽

Dian Nur Ratri ◽

Damiana Fitria Kussatiti

Keyword(s):

Surface Temperature ◽

Bayesian Model ◽

Bayesian Model Averaging ◽

Model Averaging ◽

Seasonal Forecast ◽

Sea Surface ◽

Post Processing ◽

Enso Events ◽

Prediction Quality

Abstract Prediction of Sea Surface Temperature (SST) in Niño3.4 region (170 W - 120 W; 5S - 5N) is important as a valuable indicator to identify El Niño Southern Oscillation (ENSO), i.e., El Niño, La Niña, and Neutral condition for coming months. More accurate prediction Niño3.4 SST can be used to determine the response of ENSO phenomenon to rainfall over Indonesia region. SST predictions are routinely released by meteorological institutions such as the European Center for Medium-Range Weather Forecasts (ECMWF). However, SST predictions from the direct output (RAW) of global models such as ECMWF seasonal forecast is suffering from bias that affects the poor quality of SST predictions. As a result, it also increases the potential errors in predicting the ENSO events. This study uses SST from the output Ensemble Prediction System (EPS) of ECMWF seasonal forecast, namely SEAS5. SEAS5 SST is downloaded from The Copernicus Climate Change Service (C3S) for period 1993-2020. One value representing SST over Niño3.4 region is calculated for each lead-time (LT), LT0-LT6. Bayesian Model Averaging (BMA) is selected as one of the post-processing methods to improve the prediction quality of SEAS5-RAW. The advantage of BMA over other post-processing methods is its ability to quantify the uncertainty in EPS, which is expressed as probability density function (PDF) predictive. It was found that the BMA calibration process reaches optimal performance using 160 months training window. The result show, prediction quality of Niño3.4 SST of BMA output is superior to SEAS5-RAW, especially for LT0, LT1, and LT2. In term deterministic prediction, BMA shows a lower Root Mean Square Error (RMSE), higher Proportion of Correct (PC). In term probabilistic prediction, the error rate of BMA, which is showed by the Brier Score is lower than RAW. Moreover, BMA shows a good ability to discriminating ENSO events which indicates by AUC ROC close to a perfect score.

Download Full-text

Validation-Based Sparse Gaussian Process Classifier Design

Neural Computation ◽

10.1162/neco.2009.03-08-724 ◽

2009 ◽

Vol 21 (7) ◽

pp. 2082-2103 ◽

Cited By ~ 1

Author(s):

Shirish Shevade ◽

S. Sundararajan

Keyword(s):

Bayesian Methods ◽

Real World ◽

Basis Vector ◽

Data Sets ◽

Training Set ◽

Classifier Design ◽

Set Size ◽

Benchmark Data ◽

Regression Problems ◽

Classification And Regression

Gaussian processes (GPs) are promising Bayesian methods for classification and regression problems. Design of a GP classifier and making predictions using it is, however, computationally demanding, especially when the training set size is large. Sparse GP classifiers are known to overcome this limitation. In this letter, we propose and study a validation-based method for sparse GP classifier design. The proposed method uses a negative log predictive (NLP) loss measure, which is easy to compute for GP models. We use this measure for both basis vector selection and hyperparameter adaptation. The experimental results on several real-world benchmark data sets show better or comparable generalization performance over existing methods.

Download Full-text

Increasing the Prediction Quality of Software Defective Modules with Automatic Feature Engineering

Advances in Intelligent Systems and Computing - Information Technology – New Generations ◽

10.1007/978-3-319-77028-4_68 ◽

2018 ◽

pp. 527-535

Author(s):

Alexandre Moreira Nascimento ◽

Vinícius Veloso de Melo ◽

Luiz Alberto Vieira Dias ◽

Adilson Marques da Cunha

Keyword(s):

Feature Engineering ◽

Prediction Quality

Download Full-text

Recognizing Hand-Printed Letters and Digits Using Backpropagation Learning

Neural Computation ◽

10.1162/neco.1991.3.2.258 ◽

1991 ◽

Vol 3 (2) ◽

pp. 258-267 ◽

Cited By ~ 70

Author(s):

Gale L. Martin ◽

James A. Pittman

Keyword(s):

Character Recognition ◽

High Performance ◽

Network Capacity ◽

Training Sample ◽

Error Rates ◽

Training Set ◽

Set Size ◽

Bank Checks ◽

Recognition Systems ◽

High Recognition Accuracy

We report on results of training backpropagation nets with samples of hand-printed digits scanned off of bank checks and hand-printed letters interactively entered into a computer through a stylus digitizer. Generalization results are reported as a function of training set size and network capacity. Given a large training set, and a net with sufficient capacity to achieve high performance on the training set, nets typically achieved error rates of 4-5% at a 0% reject rate and 1-2% at a 10% reject rate. The topology and capacity of the system, as measured by the number of connections in the net, have surprisingly little effect on generalization. For those developing hand-printed character recognition systems, these results suggest that a large and representative training sample may be the single, most important factor in achieving high recognition accuracy. Benefits of reducing the number of net connections, other than improving generalization, are discussed.

Download Full-text

An Assessment of the Prediction Quality of VPIN

Advanced Analytics and Artificial Intelligence Applications ◽

10.5772/intechopen.86532 ◽

2019 ◽

Author(s):

Antoine Bambade ◽

Kesheng Wu

Keyword(s):

Prediction Quality

Download Full-text

How Sure Can We Be about ML Methods-Based Evaluation of Compound Activity: Incorporation of Information about Prediction Uncertainty Using Deep Learning Techniques

Molecules ◽

10.3390/molecules25061452 ◽

2020 ◽

Vol 25 (6) ◽

pp. 1452

Author(s):

Igor Sieradzki ◽

Damian Leśniak ◽

Sabina Podlewska

Keyword(s):

Machine Learning ◽

Training Set ◽

Activity Prediction ◽

Prediction Uncertainty ◽

Screening Experiments ◽

Learning Techniques ◽

Model Training ◽

Short Time ◽

Selection Of

A great variety of computational approaches support drug design processes, helping in selection of new potentially active compounds, and optimization of their physicochemical and ADMET properties. Machine learning is a group of methods that are able to evaluate in relatively short time enormous amounts of data. However, the quality of machine-learning-based prediction depends on the data supplied for model training. In this study, we used deep neural networks for the task of compound activity prediction and developed dropout-based approaches for estimating prediction uncertainty. Several types of analyses were performed: the relationships between the prediction error, similarity to the training set, prediction uncertainty, number and standard deviation of activity values were examined. It was tested whether incorporation of information about prediction uncertainty influences compounds ranking based on predicted activity and prediction uncertainty was used to search for the potential errors in the ChEMBL database. The obtained outcome indicates that incorporation of information about uncertainty of compound activity prediction can be of great help during virtual screening experiments.

Download Full-text