scholarly journals Metadynamics sampling in atomic environment space for collecting training data for machine learning potentials

2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Dongsun Yoo ◽  
Jisu Jung ◽  
Wonseok Jeong ◽  
Seungwu Han

AbstractThe universal mathematical form of machine-learning potentials (MLPs) shifts the core of development of interatomic potentials to collecting proper training data. Ideally, the training set should encompass diverse local atomic environments but conventional approaches are prone to sampling similar configurations repeatedly, mainly due to the Boltzmann statistics. As such, practitioners handpick a large pool of distinct configurations manually, stretching the development period significantly. To overcome this hurdle, methods are being proposed that automatically generate training data. Herein, we suggest a sampling method optimized for gathering diverse yet relevant configurations semi-automatically. This is achieved by applying the metadynamics with the descriptor for the local atomic environment as a collective variable. As a result, the simulation is automatically steered toward unvisited local environment space such that each atom experiences diverse chemical environments without redundancy. We apply the proposed metadynamics sampling to H:Pt(111), GeTe, and Si systems. Throughout these examples, a small number of metadynamics trajectories can provide reference structures necessary for training high-fidelity MLPs. By proposing a semi-automatic sampling method tuned for MLPs, the present work paves the way to wider applications of MLPs to many challenging applications.

2021 ◽  
Vol 11 (5) ◽  
pp. 2039
Author(s):  
Hyunseok Shin ◽  
Sejong Oh

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.


2021 ◽  
Vol 13 (3) ◽  
pp. 368
Author(s):  
Christopher A. Ramezan ◽  
Timothy A. Warner ◽  
Aaron E. Maxwell ◽  
Bradley S. Price

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.


2020 ◽  
pp. 105971231989648 ◽  
Author(s):  
David Windridge ◽  
Henrik Svensson ◽  
Serge Thill

We consider the benefits of dream mechanisms – that is, the ability to simulate new experiences based on past ones – in a machine learning context. Specifically, we are interested in learning for artificial agents that act in the world, and operationalize “dreaming” as a mechanism by which such an agent can use its own model of the learning environment to generate new hypotheses and training data. We first show that it is not necessarily a given that such a data-hallucination process is useful, since it can easily lead to a training set dominated by spurious imagined data until an ill-defined convergence point is reached. We then analyse a notably successful implementation of a machine learning-based dreaming mechanism by Ha and Schmidhuber (Ha, D., & Schmidhuber, J. (2018). World models. arXiv e-prints, arXiv:1803.10122). On that basis, we then develop a general framework by which an agent can generate simulated data to learn from in a manner that is beneficial to the agent. This, we argue, then forms a general method for an operationalized dream-like mechanism. We finish by demonstrating the general conditions under which such mechanisms can be useful in machine learning, wherein the implicit simulator inference and extrapolation involved in dreaming act without reinforcing inference error even when inference is incomplete.


2019 ◽  
Vol 123 (12) ◽  
pp. 6941-6957 ◽  
Author(s):  
Henry Chan ◽  
Badri Narayanan ◽  
Mathew J. Cherukara ◽  
Fatih G. Sen ◽  
Kiran Sasikumar ◽  
...  

Author(s):  
Tomasz Kajdanowicz ◽  
Slawomir Plamowski ◽  
Przemyslaw Kazienko

Choosing a proper training set for machine learning tasks is of great importance in complex domain problems. In the paper a new distance measure for training set selection is presented and thoroughly discussed. The distance between two datasets is computed using variance of entropy in groups obtained after clustering. The approach is validated using real domain datasets from debt portfolio valuation process. Eventually, prediction performance is examined.


2021 ◽  
Author(s):  
Roman Zubatyuk ◽  
Justin Smith ◽  
Benjamin T. Nebgen ◽  
Sergei Tretiak ◽  
Olexandr Isayev

<p></p><p>Physics-inspired Artificial Intelligence (AI) is at the forefront of methods development in molecular modeling and computational chemistry. In particular, interatomic potentials derived with Machine Learning algorithms such as Deep Neural Networks (DNNs), achieve the accuracy of high-fidelity quantum mechanical (QM) methods in areas traditionally dominated by empirical force fields and allow performing massive simulations. The applicability domain of DNN potentials is usually limited by the type of training data. As such, transferable models are aimed to be extensible in the description of chemical and conformational diversity of organic molecules. However, most DNN potentials, such as the AIMNet model we proposed previously, were parametrized for neutral molecules or closed-shell ions due to architectural limitations. In this work, we extend machine learning framework toward open-shell anions and cations. We introduce AIMNet-NSE (Neural Spin Equilibration) architecture, which being properly trained, could predict atomic and molecular properties for an arbitrary combination of molecular charge and spin multiplicity. This model explores a new dimension of transferability by adding the charge-spin space. The AIMNet-NSE model is capable of reproducing reference QM energies for cations, neutrals, and anions with errors of about 2-3 kcal/mol, compared to the reference QM simulations. The spin-charges have errors ~0.01 electrons for small organic molecules containing nine chemical elements {H, C, N, O, F, Si, P, S and Cl}. <a>The AIMNet-NSE model allows to fully bypass QM calculations and derive the ionization potential, electron affinity, and conceptual Density Functional Theory quantities like electronegativity, hardness, and condensed Fukui functions with a speed up to 10<sup>4</sup> molecules per second on a single modern GPU.</a> We show that these descriptors, along with learned atomic representations, could be used to model chemical reactivity through an example of regioselectivity in electrophilic aromatic substitution reactions.</p><p></p>


2018 ◽  
Vol 86 (1) ◽  
Author(s):  
Xin Lei ◽  
Chang Liu ◽  
Zongliang Du ◽  
Weisheng Zhang ◽  
Xu Guo

In the present work, it is intended to discuss how to achieve real-time structural topology optimization (i.e., obtaining the optimized distribution of a certain amount of material in a prescribed design domain almost instantaneously once the objective/constraint functions and external stimuli/boundary conditions are specified), an ultimate dream pursued by engineers in various disciplines, using machine learning (ML) techniques. To this end, the so-called moving morphable component (MMC)-based explicit framework for topology optimization is adopted for generating training set and supported vector regression (SVR) as well as K-nearest-neighbors (KNN) ML models are employed to establish the mapping between the design parameters characterizing the layout/topology of an optimized structure and the external load. Compared with existing approaches, the proposed approach can not only reduce the training data and the dimension of parameter space substantially, but also has the potential of establishing engineering intuitions on optimized structures corresponding to various external loads through the learning process. Numerical examples provided demonstrate the effectiveness and advantages of the proposed approach.


2019 ◽  
Vol 11 (22) ◽  
pp. 2596 ◽  
Author(s):  
Luca Zappa ◽  
Matthias Forkel ◽  
Angelika Xaver ◽  
Wouter Dorigo

Agricultural and hydrological applications could greatly benefit from soil moisture (SM) information at sub-field resolution and (sub-) daily revisit time. However, current operational satellite missions provide soil moisture information at either lower spatial or temporal resolution. Here, we downscale coarse resolution (25–36 km) satellite SM products with quasi-daily resolution to the field scale (30 m) using the random forest (RF) machine learning algorithm. RF models are trained with remotely sensed SM and ancillary variables on soil texture, topography, and vegetation cover against SM measured in the field. The approach is developed and tested in an agricultural catchment equipped with a high-density network of low-cost SM sensors. Our results show a strong consistency between the downscaled and observed SM spatio-temporal patterns. We found that topography has higher predictive power for downscaling than soil texture, due to the hilly landscape of the study area. Furthermore, including a proxy of vegetation cover results in considerable improvements of the performance. Increasing the training set size leads to significant gain in the model skill and expanding the training set is likely to further enhance the accuracy. When only limited in-situ measurements are available as training data, increasing the number of sensor locations should be favored over expanding the duration of the measurements for improved downscaling performance. In this regard, we show the potential of low-cost sensors as a practical and cost-effective solution for gathering the necessary observations. Overall, our findings highlight the suitability of using ground measurements in conjunction with machine learning to derive high spatially resolved SM maps from coarse-scale satellite products.


Author(s):  
Andrey Parasich ◽  
Victor Parasich ◽  
Irina Parasich

Introduction: Proper training set formation is a key factor in machine learning. In real training sets, problems and errors commonly occur, having a critical impact on the training result. Training set need to be formed in all machine learning problems; therefore, knowledge of possible difficulties will be helpful. Purpose: Overview of possible problems in the formation of a training set, in order to facilitate their detection and elimination when working with real training sets. Analyzing the impact of these problems on the results of the training.  Results: The article makes on overview of possible errors in training set formation, such as lack of data, imbalance, false patterns, sampling from a limited set of sources, change in the general population over time, and others. We discuss the influence of these errors on the result of the training, test set formation, and training algorithm quality measurement. The pseudo-labeling, data augmentation, and hard samples mining are considered the most effective ways to expand a training set. We offer practical recommendations for the formation of a training or test set. Examples from the practice of Kaggle competitions are given. For the problem of cross-dataset generalization in neural network training, we propose an algorithm called Cross-Dataset Machine, which is simple to implement and allows you to get a gain in cross-dataset generalization. Practical relevance: The materials of the article can be used as a practical guide in solving machine learning problems.


The amenable statement with respective to Company Organization, Institution and students is that the company organization are taking more time to recruit which is a big challenge to them and there is no specific platform to recruit candidates on preferred qualifications. The Institutions are unable to get 100% placements among eligible students.The institutions doesn’t provide proper training on minimum and preferred qualifications to the students. The candidates are unable to get specific training from college organization. The college organization should provide the training to candidates at what they are lagging behind and make the students to get stronger in preferred qualifications and all other aspects. “52% of Talent Acquisition leaders say the hardest part of recruitment is screening candidates from a large applicant pool”. The time spent on screening students from a large pool often takes up the largest portion of the time. Despite College Organization are also not training students effectively based on company requirements. The analysis of student is to be done to know where the student is failing to get the placement. The company doesn’t know the personalityof the student while recruiting the students. To solve this bottleneck in recruiting we created this automation tool. The main process of determining whether a candidate is qualified based on minimum qualifications like CGPA, Certifications, Projects done, Internships and respectively. There are two main goals of this project are: 1. To decide whether to move the student forward to an interview or to reject them. 2. The college organization can give more training to the students those who got rejected by small issues like communication, programming, aptitude… This process is based on minimum qualifications and preferred qualifications. Both types of qualifications are more useful to the recruiters. These qualifications can include experience on projects, education, skills and knowledge, personality traits, competencies. The minimum qualifications are the mandatory qualifications that the company organizations required and preferred qualifications are not mandatory but to make the student stronger from other students. The personality and also technical knowledge can be given accurately by the faculty, mentor, H.O.D.


Sign in / Sign up

Export Citation Format

Share Document