A GPU-Accelerated Machine Learning Framework for Molecular Simulation: HOOMD-blue with TensorFlow

<div> <div> <div> <p>As interest grows in applying machine learning force-fields and methods to molecular simulation, there is a need for state-of-the-art inference methods to use trained models within efficient molecular simulation engines. We have designed and implemented software that enables integration of a scalable GPU-accelerated molecular mechanics engine, HOOMD-blue, with the machine learning (ML) TensorFlow package. TensorFlow is a GPU-accelerated, scalable, graph-based tensor computation model building package that has been the implementation of many recent innovations in deep learning and other ML tasks. TensorFlow models are constructed in Python and can be visualized or debugged using the rich set of tools implemented in the TensorFlow package. In this article, we present four major examples of tasks this software can accomplish which would normally require multiple different tools: (1) we train a neural network to reproduce a force field of a Lennard-Jones simulation; (2) we perform online force matching of methanol; (3) we compute the maximum entropy bias of a Lennard-Jones collective variable; (4) we calculate the scattering profile of an ongoing TIP4P water molecular dynamics simulation. This work should accelerate both the design of new neural network based models in computational chemistry research and reproducible model specification by leveraging a widely-used ML package.</p></div></div></div>

Download Full-text

A GPU-Accelerated Machine Learning Framework for Molecular Simulation: Hoomd-Blue with TensorFlow

10.26434/chemrxiv.8019527.v1 ◽

2019 ◽

Author(s):

Rainier Barrett ◽

Maghesree Chakraborty ◽

Dilnoza Amirkulova ◽

Heta Gandhi ◽

Andrew White

Keyword(s):

Machine Learning ◽

Computational Models ◽

Model Building ◽

Automatic Differentiation ◽

Force Fields ◽

Model Specification ◽

Efficient Computation ◽

Collective Variables ◽

Chemistry Research ◽

The Rich

We have designed and implemented software that enables integration of a scalable GPU-accelerated molecular mechanics engine, Hoomd-blue, with the machine learning (ML) TensorFlow package. TensorFlow is a GPU accelerated, scalable, graph-based tensor computation model building package that has been the implementation of many recent innovations in deep learning and other ML tasks. Tensor computation graphs allow for designation of robust, flexible, and easily replicated computational models for a variety of tasks. Our plugin leverages the generality and speed of computational tensor graphs in TensorFlow to enable four previously challenging tasks in molecular dynamics: (1) the calculation of arbitrary force-fields including neural-network-based, stochastic, and/or automatically-generated force-fields which are differentiated from potential functions; (2) the efficient computation of arbitrary collective variables; (3) the biasing of simulations via automatic differentiation of collective variables and consequently the implementation of many free energy biasing methods; (4) ML on any of the above tasks, including coarse grain force fields, on-the-fly learned biases, and collective variable calculations. The TensorFlow models are constructed in Python and can be visualized or debugged using the rich set of tools implemented in the TensorFlow package. In this article, we present examples of the four major tasks this method can accomplish, benchmark data, and describe the architecture of our implementation. This method should lead to both the design of new models in computational chemistry research and reproducible model specification without requiring recompiling or writing low-level code. <br>

Download Full-text

A GPU-Accelerated Machine Learning Framework for Molecular Simulation: Hoomd-Blue with TensorFlow

10.26434/chemrxiv.8019527.v2 ◽

2019 ◽

Author(s):

Rainier Barrett ◽

Maghesree Chakraborty ◽

Dilnoza Amirkulova ◽

Heta Gandhi ◽

Andrew White

Keyword(s):

Machine Learning ◽

Computational Models ◽

Model Building ◽

Automatic Differentiation ◽

Force Fields ◽

Model Specification ◽

Efficient Computation ◽

Collective Variables ◽

Chemistry Research ◽

The Rich

We have designed and implemented software that enables integration of a scalable GPU-accelerated molecular mechanics engine, Hoomd-blue, with the machine learning (ML) TensorFlow package. TensorFlow is a GPU accelerated, scalable, graph-based tensor computation model building package that has been the implementation of many recent innovations in deep learning and other ML tasks. Tensor computation graphs allow for designation of robust, flexible, and easily replicated computational models for a variety of tasks. Our plugin leverages the generality and speed of computational tensor graphs in TensorFlow to enable four previously challenging tasks in molecular dynamics: (1) the calculation of arbitrary force-fields including neural-network-based, stochastic, and/or automatically-generated force-fields which are differentiated from potential functions; (2) the efficient computation of arbitrary collective variables; (3) the biasing of simulations via automatic differentiation of collective variables and consequently the implementation of many free energy biasing methods; (4) ML on any of the above tasks, including coarse grain force fields, on-the-fly learned biases, and collective variable calculations. The TensorFlow models are constructed in Python and can be visualized or debugged using the rich set of tools implemented in the TensorFlow package. In this article, we present examples of the four major tasks this method can accomplish, benchmark data, and describe the architecture of our implementation. This method should lead to both the design of new models in computational chemistry research and reproducible model specification without requiring recompiling or writing low-level code. <br>

Download Full-text

Adsorption Isotherm Predictions for Multiple Molecules in MOFs Using the Same Deep Learning Model

10.26434/chemrxiv.9894224.v1 ◽

2019 ◽

Author(s):

Ryther Anderson ◽

Achay Biong ◽

Diego Gómez-Gualdrón

Keyword(s):

Neural Network ◽

Machine Learning ◽

Molecular Simulation ◽

Large Scale ◽

Learning Model ◽

Operating Conditions ◽

Small Subset ◽

Screening Methods ◽

Large Set ◽

Metal Organic

<div>Tailoring the structure and chemistry of metal-organic frameworks (MOFs) enables the manipulation of their adsorption properties to suit specific energy and environmental applications. As there are millions of possible MOFs (with tens of thousands already synthesized), molecular simulation, such as grand canonical Monte Carlo (GCMC), has frequently been used to rapidly evaluate the adsorption performance of a large set of MOFs. This allows subsequent experiments to focus only on a small subset of the most promising MOFs. In many instances, however, even molecular simulation becomes prohibitively time consuming, underscoring the need for alternative screening methods, such as machine learning, to precede molecular simulation efforts. In this study, as a proof of concept, we trained a neural network as the first example of a machine learning model capable of predicting full adsorption isotherms of different molecules not included in the training of the model. To achieve this, we trained our neural network only on alchemical species, represented only by their geometry and force field parameters, and used this neural network to predict the loadings of real adsorbates. We focused on predicting room temperature adsorption of small (one- and two-atom) molecules relevant to chemical separations. Namely, argon, krypton, xenon, methane, ethane, and nitrogen. However, we also observed surprisingly promising predictions for more complex molecules, whose properties are outside the range spanned by the alchemical adsorbates. Prediction accuracies suitable for large-scale screening were achieved using simple MOF (e.g. geometric properties and chemical moieties), and adsorbate (e.g. forcefield parameters and geometry) descriptors. Our results illustrate a new philosophy of training that opens the path towards development of machine learning models that can predict the adsorption loading of any new adsorbate at any new operating conditions in any new MOF.</div>

Download Full-text

Molecular dynamics simulation of metallic Al-Ce liquids using a neural network machine learning interatomic potential

The Journal of Chemical Physics ◽

10.1063/5.0066061 ◽

2021 ◽

Author(s):

Ling Tang ◽

K. M. Ho ◽

Cai-Zhuang Wang

Keyword(s):

Neural Network ◽

Machine Learning ◽

Molecular Dynamics ◽

Molecular Dynamics Simulation ◽

Interatomic Potential ◽

Dynamics Simulation

Download Full-text

Predicting Alert Source Device using Machine Learning Algorithms

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.d1526.079920 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1-10

Keyword(s):

Neural Network ◽

Machine Learning ◽

Model Building ◽

Learning Algorithm ◽

Learning Algorithms ◽

Research Work ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Imbalanced Dataset ◽

Daunting Task

In a large distributed virtualized environment, predicting the alerting source from its text seems to be daunting task. This paper explores the option of using machine learning algorithm to solve this problem. Unfortunately, our training dataset is highly imbalanced. Where 96% of alerting data is reported by 24% of alerting sources. This is the expected dataset in any live distributed virtualized environment, where new version of device will have relatively less alert compared to older devices. Any classification effort with such imbalanced dataset present different set of challenges compared to binary classification. This type of skewed data distribution makes conventional machine learning less effective, especially while predicting the minority device type alerts. Our challenge is to build a robust model which can cope with this imbalanced dataset and achieves relative high level of prediction accuracy. This research work stared with traditional regression and classification algorithms using bag of words model. Then word2vec and doc2vec models are used to represent the words in vector formats, which preserve the sematic meaning of the sentence. With this alerting text with similar message will have same vector form representation. This vectorized alerting text is used with Logistic Regression for model building. This yields better accuracy, but the model is relatively complex and demand more computational resources. Finally, simple neural network is used for this multi-class text classification problem domain by using keras and tensorflow libraries. A simple two layered neural network yielded 99 % accuracy, even though our training dataset was not balanced. This paper goes through the qualitative evaluation of the different machine learning algorithms and their respective result. Finally, two layered deep learning algorithms is selected as final solution, since it takes relatively less resource and time with better accuracy values.

Download Full-text

An Insight for Cursive Context-Specific Printed Script Recognition

10.31224/osf.io/z6jkn ◽

2021 ◽

Author(s):

Humera Rafique ◽

Tariq Javid

Keyword(s):

Neural Network ◽

Machine Learning ◽

Language Processing ◽

Literary History ◽

Learning Problems ◽

Work Related ◽

Benchmark Datasets ◽

The Rich ◽

Essential Knowledge ◽

Context Specific

The greatest challenge of machine learning problems is to select suitable techniques and resources such as tools and datasets. Despite the existence of millions of speakers around the globe and the rich literary history of more than a thousand years, it is expensive to find the computational linguistic work related to Punjabi Shahmukhi script, a member of the Perso-Arabic context-specific script low-resource language family. This paper presents a deep insight into the related work with summary statistics, advocating the popularity and success of artificial neural networks and related techniques. The paper includes support from recent trends from the authentic sources based on the top-level researchers' feedback including the machine learning frameworks. A comprehensive comparison of the most popular deep learning techniques convolutional neural network and the recursive neural network has been presented for the cursive context-specific scripts of Perso-Arabic nature. The overview of the available benchmark datasets for machine learning problems, especially for the Perso-Arabic group, is added. This paper incorporates essential knowledge contents for the researchers in machine learning and natural language processing disciplines on the selection of algorithms, architectures, and resources.

Download Full-text

NeuRiPP: Neural network identification of RiPP precursor peptides

10.1101/616060 ◽

2019 ◽

Cited By ~ 1

Author(s):

Emmanuel L.C. de los Santos

Keyword(s):

Neural Network ◽

Machine Learning ◽

Network Models ◽

Gene Clusters ◽

Learning Tools ◽

Neural Network Models ◽

Data Set ◽

The Rich ◽

Tailoring Enzymes ◽

Rich Data

ABSTRACTSignificant progress has been made in the past few years on the computational identification biosynthetic gene clusters (BGCs) that encode ribosomally synthesized and post-translationally modified peptides (RiPPs). This is done by identifying both RiPP tailoring enzymes (RTEs) and RiPP precursor peptides (PPs). However, identification of PPs, particularly for novel RiPP classes remains challenging. To address this, machine learning has been used to accurately identify PP sequences. However, current machine learning tools have limitations, since they are specific to the RiPP-class they are trained for, and are context-dependent, requiring information about the surrounding genetic environment of the putative PP sequences. NeuRiPP overcomes these limitations. It does this by leveraging the rich data set of high-confidence putative PP sequences from existing programs, along with experimentally verified PPs from RiPP databases. NeuRiPP uses neural network models that are suitable for peptide classification with weights trained on PP datasets. It is able to identify known PP sequences, and sequences that are likely PPs. When tested on existing RiPP BGC datasets, NeuRiPP is able to identify PP sequences in significantly more putative RiPP clusters than current tools, while maintaining the same HMM hit accuracy. Finally, NeuRiPP was able to successfully identify PP sequences from novel RiPP classes that are recently characterized experimentally, highlighting its utility in complementing existing bioinformatics tools.

Download Full-text

NeuRiPP: Neural network identification of RiPP precursor peptides

Scientific Reports ◽

10.1038/s41598-019-49764-z ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 13

Author(s):

Emmanuel L. C. de los Santos

Keyword(s):

Neural Network ◽

Machine Learning ◽

Gene Clusters ◽

Learning Tools ◽

Data Set ◽

Modified Peptides ◽

The Rich ◽

Tailoring Enzymes ◽

Rich Data ◽

Peptide Classification

Abstract Significant progress has been made in the past few years on the computational identification of biosynthetic gene clusters (BGCs) that encode ribosomally synthesized and post-translationally modified peptides (RiPPs). This is done by identifying both RiPP tailoring enzymes (RTEs) and RiPP precursor peptides (PPs). However, identification of PPs, particularly for novel RiPP classes remains challenging. To address this, machine learning has been used to accurately identify PP sequences. Current machine learning tools have limitations, since they are specific to the RiPPclass they are trained for and are context-dependent, requiring information about the surrounding genetic environment of the putative PP sequences. NeuRiPP overcomes these limitations. It does this by leveraging the rich data set of high-confidence putative PP sequences from existing programs, along with experimentally verified PPs from RiPP databases. NeuRiPP uses neural network archictectures that are suitable for peptide classification with weights trained on PP datasets. It is able to identify known PP sequences, and sequences that are likely PPs. When tested on existing RiPP BGC datasets, NeuRiPP was able to identify PP sequences in significantly more putative RiPP clusters than current tools while maintaining the same HMM hit accuracy. Finally, NeuRiPP was able to successfully identify PP sequences from novel RiPP classes that were recently characterized experimentally, highlighting its utility in complementing existing bioinformatics tools.

Download Full-text

ENNGene: an Easy Neural Network model building tool for Genomics

10.1101/2021.11.26.424041 ◽

2021 ◽

Author(s):

Eliska Chalupova ◽

Ondrej Vaculik ◽

Filip Jozefov ◽

Jakub Polacek ◽

Tomas Majtner ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Neural Networks ◽

Big Data ◽

Deep Learning ◽

Network Model ◽

Neural Network Model ◽

Model Building ◽

Evolutionary Conservation ◽

Learning Methods

Background: The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field. Results: Here we present ENNGene - Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein. Conclusions: As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field.

Download Full-text