Metadynamics sampling in atomic environment space for collecting training data for machine learning potentials

AbstractThe universal mathematical form of machine-learning potentials (MLPs) shifts the core of development of interatomic potentials to collecting proper training data. Ideally, the training set should encompass diverse local atomic environments but conventional approaches are prone to sampling similar configurations repeatedly, mainly due to the Boltzmann statistics. As such, practitioners handpick a large pool of distinct configurations manually, stretching the development period significantly. To overcome this hurdle, methods are being proposed that automatically generate training data. Herein, we suggest a sampling method optimized for gathering diverse yet relevant configurations semi-automatically. This is achieved by applying the metadynamics with the descriptor for the local atomic environment as a collective variable. As a result, the simulation is automatically steered toward unvisited local environment space such that each atom experiences diverse chemical environments without redundancy. We apply the proposed metadynamics sampling to H:Pt(111), GeTe, and Si systems. Throughout these examples, a small number of metadynamics trajectories can provide reference structures necessary for training high-fidelity MLPs. By proposing a semi-automatic sampling method tuned for MLPs, the present work paves the way to wider applications of MLPs to many challenging applications.

Download Full-text

Feature-Weighted Sampling for Proper Evaluation of Classification Models

Applied Sciences ◽

10.3390/app11052039 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2039

Author(s):

Hyunseok Shin ◽

Sejong Oh

Keyword(s):

Random Sampling ◽

Sampling Method ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Feature Importance ◽

Proper Training ◽

Machine Learning Applications ◽

Test Sets ◽

The Given

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

Download Full-text

Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data

Remote Sensing ◽

10.3390/rs13030368 ◽

2021 ◽

Vol 13 (3) ◽

pp. 368

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell ◽

Bradley S. Price

Keyword(s):

Machine Learning ◽

Sample Size ◽

Remotely Sensed ◽

Training Data ◽

Supervised Machine Learning ◽

Sample Sizes ◽

Remotely Sensed Data ◽

Large Area ◽

Training Set ◽

Set Size

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

On the utility of dreaming: A general model for how learning in artificial agents can benefit from data hallucination

Adaptive Behavior ◽

10.1177/1059712319896489 ◽

2020 ◽

pp. 105971231989648 ◽

Cited By ~ 2

Author(s):

David Windridge ◽

Henrik Svensson ◽

Serge Thill

Keyword(s):

Machine Learning ◽

Simulated Data ◽

Training Data ◽

Successful Implementation ◽

Artificial Agents ◽

Learning Context ◽

Training Set ◽

Convergence Point ◽

And Training ◽

General Method

We consider the benefits of dream mechanisms – that is, the ability to simulate new experiences based on past ones – in a machine learning context. Specifically, we are interested in learning for artificial agents that act in the world, and operationalize “dreaming” as a mechanism by which such an agent can use its own model of the learning environment to generate new hypotheses and training data. We first show that it is not necessarily a given that such a data-hallucination process is useful, since it can easily lead to a training set dominated by spurious imagined data until an ill-defined convergence point is reached. We then analyse a notably successful implementation of a machine learning-based dreaming mechanism by Ha and Schmidhuber (Ha, D., & Schmidhuber, J. (2018). World models. arXiv e-prints, arXiv:1803.10122). On that basis, we then develop a general framework by which an agent can generate simulated data to learn from in a manner that is beneficial to the agent. This, we argue, then forms a general method for an operationalized dream-like mechanism. We finish by demonstrating the general conditions under which such mechanisms can be useful in machine learning, wherein the implicit simulator inference and extrapolation involved in dreaming act without reinforcing inference error even when inference is incomplete.

Download Full-text

Machine Learning Classical Interatomic Potentials for Molecular Dynamics from First-Principles Training Data

The Journal of Physical Chemistry C ◽

10.1021/acs.jpcc.8b09917 ◽

2019 ◽

Vol 123 (12) ◽

pp. 6941-6957 ◽

Cited By ~ 31

Author(s):

Henry Chan ◽

Badri Narayanan ◽

Mathew J. Cherukara ◽

Fatih G. Sen ◽

Kiran Sasikumar ◽

...

Keyword(s):

Machine Learning ◽

Molecular Dynamics ◽

First Principles ◽

Training Data ◽

Interatomic Potentials

Download Full-text

New Entropy Based Distance for Training Set Selection in Debt Portfolio Valuation

International Journal of Information Technology and Web Engineering ◽

10.4018/jitwe.2012040105 ◽

2012 ◽

Vol 7 (2) ◽

pp. 60-69

Author(s):

Tomasz Kajdanowicz ◽

Slawomir Plamowski ◽

Przemyslaw Kazienko

Keyword(s):

Machine Learning ◽

Distance Measure ◽

Prediction Performance ◽

Training Set ◽

Learning Tasks ◽

Real Domain ◽

Valuation Process ◽

Proper Training ◽

Training Set Selection ◽

Portfolio Valuation

Choosing a proper training set for machine learning tasks is of great importance in complex domain problems. In the paper a new distance measure for training set selection is presented and thoroughly discussed. The distance between two datasets is computed using variance of entropy in groups obtained after clustering. The approach is validated using real domain datasets from debt portfolio valuation process. Eventually, prediction performance is examined.

Download Full-text

Teaching a Neural Network to Attach and Detach Electrons from Molecules

10.26434/chemrxiv.12725276.v2 ◽

2021 ◽

Author(s):

Roman Zubatyuk ◽

Justin Smith ◽

Benjamin T. Nebgen ◽

Sergei Tretiak ◽

Olexandr Isayev

Keyword(s):

Machine Learning ◽

Density Functional ◽

Organic Molecules ◽

Chemical Elements ◽

Chemical Reactivity ◽

Machine Learning Algorithms ◽

Training Data ◽

Electrophilic Aromatic Substitution ◽

Interatomic Potentials ◽

Conceptual Density Functional Theory

Physics-inspired Artificial Intelligence (AI) is at the forefront of methods development in molecular modeling and computational chemistry. In particular, interatomic potentials derived with Machine Learning algorithms such as Deep Neural Networks (DNNs), achieve the accuracy of high-fidelity quantum mechanical (QM) methods in areas traditionally dominated by empirical force fields and allow performing massive simulations. The applicability domain of DNN potentials is usually limited by the type of training data. As such, transferable models are aimed to be extensible in the description of chemical and conformational diversity of organic molecules. However, most DNN potentials, such as the AIMNet model we proposed previously, were parametrized for neutral molecules or closed-shell ions due to architectural limitations. In this work, we extend machine learning framework toward open-shell anions and cations. We introduce AIMNet-NSE (Neural Spin Equilibration) architecture, which being properly trained, could predict atomic and molecular properties for an arbitrary combination of molecular charge and spin multiplicity. This model explores a new dimension of transferability by adding the charge-spin space. The AIMNet-NSE model is capable of reproducing reference QM energies for cations, neutrals, and anions with errors of about 2-3 kcal/mol, compared to the reference QM simulations. The spin-charges have errors ~0.01 electrons for small organic molecules containing nine chemical elements {H, C, N, O, F, Si, P, S and Cl}. <a>The AIMNet-NSE model allows to fully bypass QM calculations and derive the ionization potential, electron affinity, and conceptual Density Functional Theory quantities like electronegativity, hardness, and condensed Fukui functions with a speed up to 104 molecules per second on a single modern GPU.</a> We show that these descriptors, along with learned atomic representations, could be used to model chemical reactivity through an example of regioselectivity in electrophilic aromatic substitution reactions.

Download Full-text

Machine Learning-Driven Real-Time Topology Optimization Under Moving Morphable Component-Based Framework

Journal of Applied Mechanics ◽

10.1115/1.4041319 ◽

2018 ◽

Vol 86 (1) ◽

Cited By ~ 33

Author(s):

Xin Lei ◽

Chang Liu ◽

Zongliang Du ◽

Weisheng Zhang ◽

Xu Guo

Keyword(s):

Machine Learning ◽

Topology Optimization ◽

Real Time ◽

Training Data ◽

Design Parameters ◽

Structural Topology ◽

Training Set ◽

K Nearest Neighbors ◽

External Stimuli ◽

External Loads

In the present work, it is intended to discuss how to achieve real-time structural topology optimization (i.e., obtaining the optimized distribution of a certain amount of material in a prescribed design domain almost instantaneously once the objective/constraint functions and external stimuli/boundary conditions are specified), an ultimate dream pursued by engineers in various disciplines, using machine learning (ML) techniques. To this end, the so-called moving morphable component (MMC)-based explicit framework for topology optimization is adopted for generating training set and supported vector regression (SVR) as well as K-nearest-neighbors (KNN) ML models are employed to establish the mapping between the design parameters characterizing the layout/topology of an optimized structure and the external load. Compared with existing approaches, the proposed approach can not only reduce the training data and the dimension of parameter space substantially, but also has the potential of establishing engineering intuitions on optimized structures corresponding to various external loads through the learning process. Numerical examples provided demonstrate the effectiveness and advantages of the proposed approach.

Download Full-text

Deriving Field Scale Soil Moisture from Satellite Observations and Ground Measurements in a Hilly Agricultural Region

Remote Sensing ◽

10.3390/rs11222596 ◽

2019 ◽

Vol 11 (22) ◽

pp. 2596 ◽

Cited By ~ 5

Author(s):

Luca Zappa ◽

Matthias Forkel ◽

Angelika Xaver ◽

Wouter Dorigo

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Soil Texture ◽

Vegetation Cover ◽

Learning Algorithm ◽

Low Cost ◽

Cost Effective ◽

Training Data ◽

Field Scale ◽

Training Set

Agricultural and hydrological applications could greatly benefit from soil moisture (SM) information at sub-field resolution and (sub-) daily revisit time. However, current operational satellite missions provide soil moisture information at either lower spatial or temporal resolution. Here, we downscale coarse resolution (25–36 km) satellite SM products with quasi-daily resolution to the field scale (30 m) using the random forest (RF) machine learning algorithm. RF models are trained with remotely sensed SM and ancillary variables on soil texture, topography, and vegetation cover against SM measured in the field. The approach is developed and tested in an agricultural catchment equipped with a high-density network of low-cost SM sensors. Our results show a strong consistency between the downscaled and observed SM spatio-temporal patterns. We found that topography has higher predictive power for downscaling than soil texture, due to the hilly landscape of the study area. Furthermore, including a proxy of vegetation cover results in considerable improvements of the performance. Increasing the training set size leads to significant gain in the model skill and expanding the training set is likely to further enhance the accuracy. When only limited in-situ measurements are available as training data, increasing the number of sensor locations should be favored over expanding the duration of the measurements for improved downscaling performance. In this regard, we show the potential of low-cost sensors as a practical and cost-effective solution for gathering the necessary observations. Overall, our findings highlight the suitability of using ground measurements in conjunction with machine learning to derive high spatially resolved SM maps from coarse-scale satellite products.

Download Full-text

Training set formation in machine learning problems (review)

Information and Control Systems ◽

10.31799/1684-8853-2021-4-61-70 ◽

2021 ◽

pp. 61-70

Author(s):

Andrey Parasich ◽

Victor Parasich ◽

Irina Parasich

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Quality Measurement ◽

Learning Problems ◽

Training Set ◽

Test Set ◽

Key Factor ◽

Proper Training ◽

The Impact ◽

Training Sets

Introduction: Proper training set formation is a key factor in machine learning. In real training sets, problems and errors commonly occur, having a critical impact on the training result. Training set need to be formed in all machine learning problems; therefore, knowledge of possible difficulties will be helpful. Purpose: Overview of possible problems in the formation of a training set, in order to facilitate their detection and elimination when working with real training sets. Analyzing the impact of these problems on the results of the training. Results: The article makes on overview of possible errors in training set formation, such as lack of data, imbalance, false patterns, sampling from a limited set of sources, change in the general population over time, and others. We discuss the influence of these errors on the result of the training, test set formation, and training algorithm quality measurement. The pseudo-labeling, data augmentation, and hard samples mining are considered the most effective ways to expand a training set. We offer practical recommendations for the formation of a training or test set. Examples from the practice of Kaggle competitions are given. For the problem of cross-dataset generalization in neural network training, we propose an algorithm called Cross-Dataset Machine, which is simple to implement and allows you to get a gain in cross-dataset generalization. Practical relevance: The materials of the article can be used as a practical guide in solving machine learning problems.

Download Full-text

Automatic Student Analysis and Placement Prediction using Advanced Machine Learning Algorithms

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3664.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 4178-4183

Keyword(s):

Machine Learning ◽

Personality Traits ◽

Machine Learning Algorithms ◽

Main Process ◽

Technical Knowledge ◽

Specific Training ◽

Faculty Mentor ◽

Large Pool ◽

Proper Training ◽

Talent Acquisition

The amenable statement with respective to Company Organization, Institution and students is that the company organization are taking more time to recruit which is a big challenge to them and there is no specific platform to recruit candidates on preferred qualifications. The Institutions are unable to get 100% placements among eligible students.The institutions doesn’t provide proper training on minimum and preferred qualifications to the students. The candidates are unable to get specific training from college organization. The college organization should provide the training to candidates at what they are lagging behind and make the students to get stronger in preferred qualifications and all other aspects. “52% of Talent Acquisition leaders say the hardest part of recruitment is screening candidates from a large applicant pool”. The time spent on screening students from a large pool often takes up the largest portion of the time. Despite College Organization are also not training students effectively based on company requirements. The analysis of student is to be done to know where the student is failing to get the placement. The company doesn’t know the personalityof the student while recruiting the students. To solve this bottleneck in recruiting we created this automation tool. The main process of determining whether a candidate is qualified based on minimum qualifications like CGPA, Certifications, Projects done, Internships and respectively. There are two main goals of this project are: 1. To decide whether to move the student forward to an interview or to reject them. 2. The college organization can give more training to the students those who got rejected by small issues like communication, programming, aptitude… This process is based on minimum qualifications and preferred qualifications. Both types of qualifications are more useful to the recruiters. These qualifications can include experience on projects, education, skills and knowledge, personality traits, competencies. The minimum qualifications are the mandatory qualifications that the company organizations required and preferred qualifications are not mandatory but to make the student stronger from other students. The personality and also technical knowledge can be given accurately by the faculty, mentor, H.O.D.

Download Full-text