scholarly journals Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning

Author(s):  
Michael Fortunato ◽  
Connor W. Coley ◽  
Brian Barnes ◽  
Klavs F. Jensen

This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access datasets of organic reactions with artificially calculated template applicability and pretraining a template relevance neural network on this augmented applicability dataset, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small dataset of well curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating these strategies can be very useful for small datasets.

2020 ◽  
Author(s):  
Michael Fortunato ◽  
Connor W. Coley ◽  
Brian Barnes ◽  
Klavs F. Jensen

This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access datasets of organic reactions with artificially calculated template applicability and pretraining a template relevance neural network on this augmented applicability dataset, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small dataset of well curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating these strategies can be very useful for small datasets.


2020 ◽  
Vol 46 (2) ◽  
pp. 55-66
Author(s):  
Bernard Kumi-Boateng ◽  
Yao Yevenyo Ziggah

Machine learning algorithms have emerged as a new paradigm shift in geoscience computations and applications. The present study aims to assess the suitability of Group Method of Data Handling (GMDH) in coordinate transformation. The data used for the coordinate transformation constitute the Ghana national triangulation network which is based on the two-horizontal geodetic datums (Accra 1929 and Leigon 1977) utilised for geospatial applications in Ghana. The GMDH result was compared with other standard methods such as Backpropagation Neural Network (BPNN), Radial Basis Function Neural Network (RBFNN), 2D conformal, and 2D affine. It was observed that the proposed GMDH approach is very efficient in transforming coordinates from the Leigon 1977 datum to the official mapping datum of Ghana, i.e. Accra 1929 datum. It was also found that GMDH could produce comparable and satisfactory results just like the widely used BPNN and RBFNN. However, the classical transformation methods (2D affine and 2D conformal) performed poorly when compared with the machine learning models (GMDH, BPNN and RBFNN). The computational strength of the machine learning models’ is attributed to its self-adaptive capability to detect patterns in data set without considering the existence of functional relationships between the input and output variables. To this end, the proposed GMDH model could be used as a supplementary computational tool to the existing transformation procedures used in the Ghana geodetic reference network.


2021 ◽  
Author(s):  
Shuangxia Ren ◽  
Jill A. Zupetic ◽  
Mohammadreza Tabary ◽  
Rebecca DeSensi ◽  
Mehdi Nouraie ◽  
...  

Abstract We created an online calculator using machine learning algorithms to impute the partial pressure of oxygen (PaO2)/fraction of delivered oxygen (FiO2) ratio using the non-invasive peripheral saturation of oxygen (SpO2) and compared the accuracy of the machine learning models we developed to previously published equations. We generated three machine learning algorithms (neural network, regression, and kernel-based methods) using 7 clinical variable features (N=9,900 ICU events) and subsequently 3 features (N=20,198 ICU events) as input into the models. Data from mechanically ventilated ICU patients were obtained from the publicly available Medical Information Mart for Intensive Care (MIMIC III) database and used for analysis. Compared to seven features, three features (SpO2, FiO2 and PEEP) were sufficient to impute PaO2 from the SpO2. Any of the tested machine learning models enabled imputation of PaO2 from the SpO2 with lower error and showed greater accuracy in predicting PaO2/FiO2 < 150 compared to the previously published log-linear and non-linear equations. Imputation using data from an independent validation cohort of ICU patients (N = 133) from 2 hospitals within the University of Pittsburgh Medical Center (UPMC) showed greater accuracy with the neural network and kernel-based machine learning models compared to the previously published non-linear equation.


Author(s):  
Diwakar Naidu ◽  
Babita Majhi ◽  
Surendra Kumar Chandniha

This study focuses on modelling the changes in rainfall patterns in different agro-climatic zones due to climate change through statistical downscaling of large-scale climate variables using machine learning approaches. Potential of three machine learning algorithms, multilayer artificial neural network (MLANN), radial basis function neural network (RBFNN), and least square support vector machine (LS-SVM) have been investigated. The large-scale climate variable are obtained from National Centre for Environmental Prediction (NCEP) reanalysis product and used as predictors for model development. Proposed machine learning models are applied to generate projected time series of rainfall for the period 2021-2050 using the Hadley Centre coupled model (HadCM3) B2 emission scenario data as predictors. An increasing trend in anticipated rainfall is observed during 2021-2050 in all the ACZs of Chhattisgarh State. Among the machine learning models, RBFNN found as more feasible technique for modeling of monthly rainfall in this region.


Author(s):  
Fahem Abu Bakar ◽  
◽  
Nazri Mohd Nawi ◽  
Abdulkareem A. Hezam ◽  
◽  
...  

The use of Social Network Sites (SNS) is on the rise these days, particularly among the younger generations. Users can communicate their interests, feelings, and everyday routines thanks to the availability of social media sites. Many studies show that properly utilizing user-generated content (UGC) can aid in determining people's mental health status. The use of the UGC could aid in the prediction of mental health, particularly depression, where it is a significant medical condition that impairs one's ability to work, learn, eat, sleep, and enjoy life. However, all information about a person's mood and negativism can be gathered from their SNS user profile. Therefore, this study utilizes SNS as a data source by using machine learning models to screen and identify users in categorizing users based on their mental health. The performance of three machine learning models is evaluated to classify the UGC: Decision Forest, Neural Network, and Support Vector Machine (SVM). The results show that the accuracy and recall result of the Neural Network model is the same as the Support Vector Machine (SVM) model, which is 78.27% and 0.042, but Neural Network performs better in the average precision value. This proves that the Neural Network model is the best model for making predictions to determine the level of depression by using social media posts.


2021 ◽  
Author(s):  
Manav Agarwal ◽  
Shreya Venugopal ◽  
Rishab Kashyap ◽  
R Bharathi

The film industry is one of the most popular entertainment industries and one of the biggest markets for business. Among the contributing factors to this would be the success of a movie in terms of its popularity as well as its box office performance. Hence, we create a comprehensive comparison between the various machine learning models to predict the rate of success of a movie. The effectiveness of these models along with their statistical significance is studied to conclude which of these models is the best predictor. Some insights regarding factors that affect the success of the movies are also found. The models studied include some Regression models, Machine Learning models, a Time Series model and a Neural Network with the Neural Network being the best performing model with an accuracy of about 86%. Additionally, as part of the testing data for the movies released in 2020 are analysed.


2021 ◽  
pp. 1-15
Author(s):  
O. Basturk ◽  
C. Cetek

ABSTRACT In this study, prediction of aircraft Estimated Time of Arrival (ETA) is proposed using machine learning algorithms. Accurate prediction of ETA is important for management of delay and air traffic flow, runway assignment, gate assignment, collaborative decision making (CDM), coordination of ground personnel and equipment, and optimisation of arrival sequence etc. Machine learning is able to learn from experience and make predictions with weak assumptions or no assumptions at all. In the proposed approach, general flight information, trajectory data and weather data were obtained from different sources in various formats. Raw data were converted to tidy data and inserted into a relational database. To obtain the features for training the machine learning models, the data were explored, cleaned and transformed into convenient features. New features were also derived from the available data. Random forests and deep neural networks were used to train the machine learning models. Both models can predict the ETA with a mean absolute error (MAE) less than 6min after departure, and less than 3min after terminal manoeuvring area (TMA) entrance. Additionally, a web application was developed to dynamically predict the ETA using proposed models.


Viruses ◽  
2021 ◽  
Vol 13 (2) ◽  
pp. 252
Author(s):  
Laura M. Bergner ◽  
Nardus Mollentze ◽  
Richard J. Orton ◽  
Carlos Tello ◽  
Alice Broos ◽  
...  

The contemporary surge in metagenomic sequencing has transformed knowledge of viral diversity in wildlife. However, evaluating which newly discovered viruses pose sufficient risk of infecting humans to merit detailed laboratory characterization and surveillance remains largely speculative. Machine learning algorithms have been developed to address this imbalance by ranking the relative likelihood of human infection based on viral genome sequences, but are not yet routinely applied to viruses at the time of their discovery. Here, we characterized viral genomes detected through metagenomic sequencing of feces and saliva from common vampire bats (Desmodus rotundus) and used these data as a case study in evaluating zoonotic potential using molecular sequencing data. Of 58 detected viral families, including 17 which infect mammals, the only known zoonosis detected was rabies virus; however, additional genomes were detected from the families Hepeviridae, Coronaviridae, Reoviridae, Astroviridae and Picornaviridae, all of which contain human-infecting species. In phylogenetic analyses, novel vampire bat viruses most frequently grouped with other bat viruses that are not currently known to infect humans. In agreement, machine learning models built from only phylogenetic information ranked all novel viruses similarly, yielding little insight into zoonotic potential. In contrast, genome composition-based machine learning models estimated different levels of zoonotic potential, even for closely related viruses, categorizing one out of four detected hepeviruses and two out of three picornaviruses as having high priority for further research. We highlight the value of evaluating zoonotic potential beyond ad hoc consideration of phylogeny and provide surveillance recommendations for novel viruses in a wildlife host which has frequent contact with humans and domestic animals.


2021 ◽  
Author(s):  
Alejandro Celemín ◽  
Diego A. Estupiñan ◽  
Ricardo Nieto

Abstract Electrical Submersible Pumps reliability and run-life analysis has been extensively studied since its development. Current machine learning algorithms allow to correlate operational conditions to ESP run-life in order to generate predictions for active and new wells. Four machine learning models are compared to a linear proportional hazards model, used as a baseline for comparison purposes. Proper accuracy metrics for survival analysis problems are calculated on run-life predictions vs. actual values over training and validation data subsets. Results demonstrate that the baseline model is able to produce more consistent predictions with a slight reduction in its accuracy, compared to current machine learning models for small datasets. This study demonstrates that the quality of the date and it pre-processing supports the current shift from model-centric to data-centric approach to machine and deep learning problems.


Author(s):  
Pratyush Kaware

In this paper a cost-effective sensor has been implemented to read finger bend signals, by attaching the sensor to a finger, so as to classify them based on the degree of bent as well as the joint about which the finger was being bent. This was done by testing with various machine learning algorithms to get the most accurate and consistent classifier. Finally, we found that Support Vector Machine was the best algorithm suited to classify our data, using we were able predict live state of a finger, i.e., the degree of bent and the joints involved. The live voltage values from the sensor were transmitted using a NodeMCU micro-controller which were converted to digital and uploaded on a database for analysis.


Sign in / Sign up

Export Citation Format

Share Document