scholarly journals Machine learned calibrations to high-throughput molecular excited state calculations

Author(s):  
Shomik Verma ◽  
Miguel Rivera ◽  
David O. Scanlon ◽  
Aron Walsh

Understanding the excited state properties of molecules provides insights into how they interact with light. These interactions can be exploited to design compounds for photochemical applications, including enhanced spectral conversion of light to increase the efficiency of photovoltaic cells. While chemical discovery is time- and resource-intensive experimentally, computational chemistry can be used to screen large-scale databases for molecules of interest in a procedure known as high-throughput virtual screening. The first step usually involves a high-speed but low-accuracy method to screen large numbers of molecules (potentially millions) so only the best candidates are evaluated with expensive methods. However, use of a coarse first-pass screening method can potentially result in high false positive or false negative rates. Therefore, this study uses machine learning to calibrate a high-throughput technique (xTB-sTDA) against a higher accuracy one (TD-DFT). Testing the calibration model shows a ~5-fold decrease in error in-domain and a ~3-fold decrease out-of-domain. The resulting mean absolute error of ~0.14 eV is in line with previous work in machine learning calibrations and out-performs previous work in linear calibration of xTB-sTDA. We then apply the calibration model to screen a 250k molecule database and map inaccuracies of xTB-sTDA in chemical space. We also show generalizability of the workflow by calibrating against a higher-level technique (CC2), yielding a similarly low error. Overall, this work demonstrates machine learning can be used to develop a both cheap and accurate method for large-scale excited state screening, enabling accelerated molecular discovery across a variety of disciplines.

2020 ◽  
Author(s):  
Phani Ghanakota ◽  
Pieter Bos ◽  
Kyle Konze ◽  
Joshua Staker ◽  
Gabriel Marques ◽  
...  

The hit identification process usually involves the profiling of millions to more recently billions of compounds either via traditional experimental high throughput screens (HTS) or computational virtual high throughput screens (vHTS). We have previously demonstrated that by coupling reaction-based enumeration, active learning and free energy calculations, a similarly large-scale exploration of chemical space can be extended to the hit-to-lead process. In this work, we augment that approach by coupling large scale enumeration and cloud-based FEP profiling with goal-directed generative machine learning, which results in a higher enrichment of potent ideas compared to large scale enumeration alone, while simultaneously staying within the bounds of a predefined drug-like property space. We are able to achieve this by building the molecular distribution for generative machine learning from the PathFinder rules-based enumeration and optimizing for a weighted sum QSAR based multi-parameter optimization function. We examine the utility of this combined approach by designing potent inhibitors of cyclin-dependent kinase 2 (CDK2) and demonstrate a coupled workflow that can: (1) provide a 6.4 fold enrichment improvement in identifying < 10nM compounds over random selection, and a 1.5 fold enrichment in identifying < 10nM compounds over our previous method (2) rapidly explore relevant chemical space outside the bounds of commercial reagents, (3) use generative ML approaches to “learn” the SAR from large scale in silico enumerations and generate novel idea molecules for a flexible receptor site that are both potent and within relevant physicochemical space and (4) produce over 3,000,000 idea molecules and run 2153 FEP simulations, identifying 69 ideas with a predicted IC<sub>50</sub> < 10nM and 358 ideas with a predicted IC<sub>50</sub> <100 nM. The reported data suggest combining both reaction-based and generative machine learning for ideation results in a higher enrichment of potent compounds over previously described approaches, and can rapidly accelerate the discovery of novel chemical matter within a predefined potency and property space.<br>


2020 ◽  
Author(s):  
Phani Ghanakota ◽  
Pieter Bos ◽  
Kyle Konze ◽  
Joshua Staker ◽  
Gabriel Marques ◽  
...  

The hit identification process usually involves the profiling of millions to more recently billions of compounds either via traditional experimental high throughput screens (HTS) or computational virtual high throughput screens (vHTS). We have previously demonstrated that by coupling reaction-based enumeration, active learning and free energy calculations, a similarly large-scale exploration of chemical space can be extended to the hit-to-lead process. In this work, we augment that approach by coupling large scale enumeration and cloud-based FEP profiling with goal-directed generative machine learning, which results in a higher enrichment of potent ideas compared to large scale enumeration alone, while simultaneously staying within the bounds of a predefined drug-like property space. We are able to achieve this by building the molecular distribution for generative machine learning from the PathFinder rules-based enumeration and optimizing for a weighted sum QSAR based multi-parameter optimization function. We examine the utility of this combined approach by designing potent inhibitors of cyclin-dependent kinase 2 (CDK2) and demonstrate a coupled workflow that can: (1) provide a 6.4 fold enrichment improvement in identifying < 10nM compounds over random selection, and a 1.5 fold enrichment in identifying < 10nM compounds over our previous method (2) rapidly explore relevant chemical space outside the bounds of commercial reagents, (3) use generative ML approaches to “learn” the SAR from large scale in silico enumerations and generate novel idea molecules for a flexible receptor site that are both potent and within relevant physicochemical space and (4) produce over 3,000,000 idea molecules and run 2153 FEP simulations, identifying 69 ideas with a predicted IC<sub>50</sub> < 10nM and 358 ideas with a predicted IC<sub>50</sub> <100 nM. The reported data suggest combining both reaction-based and generative machine learning for ideation results in a higher enrichment of potent compounds over previously described approaches, and can rapidly accelerate the discovery of novel chemical matter within a predefined potency and property space.<br>


Sensors ◽  
2018 ◽  
Vol 18 (12) ◽  
pp. 4348 ◽  
Author(s):  
Wei Liu ◽  
Xin Ma ◽  
Xiao Li ◽  
Yi Pan ◽  
Fuji Wang ◽  
...  

Nowadays, due to the advantages of non-contact and high-speed, vision-based pose measurements have been widely used for aircraft performance testing in a wind tunnel. However, usually glass ports are used to protect cameras against the high-speed airflow influence, which will lead to a big measurement error. In this paper, to further improve the vision-based pose measurement accuracy, an imaging model which considers the refraction light of the observation window was proposed. In this method, a nonlinear camera calibration model considering the refraction brought by the wind tunnel observation window, was established first. What’s more, a new method for the linear calibration of the normal vector of the glass observation window was presented. Then, combining with the proposed matching method based on coplanarity constraint, the six pose parameters of the falling target could be calculated. Finally, the experimental setup was established to conduct the pose measurement study in the laboratory, and the results satisfied the application requirements. Besides, experiments for verifying the vision measurement accuracy were also performed, and the results indicated that the displacement and angle measurement accuracy approximately increased by 57% and 33.6%, respectively, which showed the high accuracy of the proposed method.


2016 ◽  
Author(s):  
Shuya Li ◽  
Fanghong Dong ◽  
Yuexin Wu ◽  
Sai Zhang ◽  
Chen Zhang ◽  
...  

AbstractCharacterizing the binding behaviors of RNA-binding proteins (RBPs) is important for understanding their functional roles in gene expression regulation. However, current high-throughput experimental methods for identifying RBP targets, such as CLIP-seq and RNAcompete, usually suffer from the false positive and false negative issues. Here, we develop a deep boosting based machine learning approach, called DeBooster, to accurately model the binding sequence preferences and identify the corresponding binding targets of RBPs from CLIP-seq data. Comprehensive validation tests have shown that DeBooster can outperform other state-of-the-art approaches in predicting RBP targets and recover false negatives that are common in current CLIP-seq data. In addition, we have demonstrated several new potential applications of DeBooster in understanding the regulatory functions of RBPs, including the binding effects of the RNA helicase MOV10 on mRNA degradation, the influence of different binding behaviors of the ADAR proteins on RNA editing, as well as the antagonizing effect of RBP binding on miRNA repression. Moreover, DeBooster may provide an effective index to investigate the effect of pathogenic mutations in RBP binding sites, especially those related to splicing events. We expect that DeBooster will be widely applied to analyze large-scale CLIP-seq experimental data and can provide a practically useful tool for novel biological discoveries in understanding the regulatory mechanisms of RBPs.


2019 ◽  
Vol 109 (2) ◽  
pp. 318-325 ◽  
Author(s):  
Francesca Nicolì ◽  
Carmine Negro ◽  
Eliana Nutricati ◽  
Marzia Vergine ◽  
Alessio Aprile ◽  
...  

Monitoring Xylella fastidiosa is critical for eradicating or at least containing this harmful pathogen. New low-cost and rapid methods for early detection capability are very much needed. Metabolomics may play a key role in diagnosis; in fact, mobile metabolites could avoid errors in sampling due to erratically distributed pathogens. Of the various different mobile signals, we studied dicarboxylic azelaic acid (AzA) which is a key molecule for biotic stress plant response but has not yet been associated with pathogens in olive trees. We found that infected Olea europaea L. plants of cultivars Cellina di Nardò (susceptible to X. fastidiosa) and Leccino (resistant to the pathogen) showed an increase in AzA accumulation in leaf petioles and in sprigs by approximately seven- and sixfold, respectively, compared with plants negative to X. fastidiosa or affected by other pathogens. No statistically significant variation was found between the X. fastidiosa population level and the amount of AzA in either of the plant tissues, suggesting that AzA accumulation was almost independent of the amount of pathogen in the sample. Furthermore, the association of AzA with X. fastidiosa seemed to be reliable for samples judged as potentially false-negative by quantitative polymerase chain reaction (cycle threshold [Ct] > 33), considering both the absolute value of AzA concentration and the values normalized on negative samples, which diverged significantly from control plants. The accumulation of AzA in infected plants was partially supported by the differential expression of two genes (named OeLTP1 and OeLTP2) encoding lipid transport proteins (LTPs), which shared a specific domain with the LTPs involved in AzA activity in systemic acquired resistance in other plant species. The expression level of OeLTP1 and OeLTP2 in petiole samples showed significant upregulation in samples positive to X. fastidiosa of both cultivars, with higher expression levels in positive samples of Cellina di Nardò compared with Leccino, whereas the two transcripts had a low expression level (Ct > 40) in negative samples of the susceptible cultivar. Although the results derived from the quantification of AzA cannot confirm the presence of the erratically distributed X. fastidiosa, which can be definitively assessed by traditional methods, we believe they represent a fast and cheap screening method for large-scale monitoring.


Inventions ◽  
2019 ◽  
Vol 4 (4) ◽  
pp. 72
Author(s):  
Ryota Sawaki ◽  
Daisuke Sato ◽  
Hiroko Nakayama ◽  
Yuki Nakagawa ◽  
Yasuhito Shimada

Background: Zebrafish are efficient animal models for conducting whole organism drug testing and toxicological evaluation of chemicals. They are frequently used for high-throughput screening owing to their high fecundity. Peripheral experimental equipment and analytical software are required for zebrafish screening, which need to be further developed. Machine learning has emerged as a powerful tool for large-scale image analysis and has been applied in zebrafish research as well. However, its use by individual researchers is restricted due to the cost and the procedure of machine learning for specific research purposes. Methods: We developed a simple and easy method for zebrafish image analysis, particularly fluorescent labelled ones, using the free machine learning program Google AutoML. We performed machine learning using vascular- and macrophage-Enhanced Green Fluorescent Protein (EGFP) fishes under normal and abnormal conditions (treated with anti-angiogenesis drugs or by wounding the caudal fin). Then, we tested the system using a new set of zebrafish images. Results: While machine learning can detect abnormalities in the fish in both strains with more than 95% accuracy, the learning procedure needs image pre-processing for the images of the macrophage-EGFP fishes. In addition, we developed a batch uploading software, ZF-ImageR, for Windows (.exe) and MacOS (.app) to enable high-throughput analysis using AutoML. Conclusions: We established a protocol to utilize conventional machine learning platforms for analyzing zebrafish phenotypes, which enables fluorescence-based, phenotype-driven zebrafish screening.


2021 ◽  
Author(s):  
Ryan Kingsbury ◽  
Ayush Gupta ◽  
Christopher Bartel ◽  
Jason Munro ◽  
Shyam Dwaraknath ◽  
...  

Computational materials discovery efforts utilize hundreds or thousands of density functional theory (DFT) calculations to predict material properties. Historically, such efforts have performed calculations at the generalized gradient approximation (GGA) level of theory due to its efficient compromise between accuracy and computational reliability. However, high-throughput calculations at the higher metaGGA level of theory are becoming feasible. The Strongly Constrainted and Appropriately Normed (SCAN) metaGGA functional offers superior accuracy to GGA across much of chemical space, making it appealing as a general-purpose metaGGA functional, but it suffers from numerical instabilities that impede it's use in high-throughput workflows. The recently-developed r2SCAN metaGGA functional promises accuracy similar to SCAN in addition to more robust numerical performance. However, its performance compared to SCAN has yet to be evaluated over a large group of solid materials. In this work, we compared r2SCAN and SCAN predictions for key properties of approximately 6,000 solid materials using a newly-developed high-throughput computational workflow. We find that r2SCAN predicts formation energies more accurately than SCAN and PBEsol for both strongly- and weakly-bound materials and that r2SCAN predicts systematically larger lattice constants than SCAN. We also find that r2SCAN requires modestly fewer computational resources than SCAN and offers significantly more reliable convergence. Thus, our large-scale benchmark confirms that r2SCAN has delivered on its promises of numerical efficiency and accuracy, making it a preferred choice for high-throughput metaGGA calculations.


2022 ◽  
Author(s):  
Yizhe Zhang ◽  
Jeremy J Agresti ◽  
Yu Zheng ◽  
David A Weitz

A restriction endonuclease (RE) is an enzyme that can recognize a specific DNA sequence and cleave that DNA into fragments with double-stranded breaks. This sequence-specific cleaving ability and its ease of use have made REs commonly used tools in molecular biology since their first isolation and characterization in 1970s. While artificial REs still face many challenges in large-scale synthesis and precise activity control for practical use, searching for new REs in natural samples remains a viable route for expanding the RE pool for fundamental research and industrial applications. In this paper, we propose a new strategy to search for REs in an efficient fashion. Briefly, we construct a host bacterial cell to link the RE genotype to the phenotype of β-galactosidase expression based on the bacterial SOS response, and use a high-throughput microfluidic platform to isolate, detect and sort the REs. We employ this strategy to screen for the XbaI gene from constructed libraries of varied sizes. In single round of sorting, a 30-fold target enrichment was obtained within 1 h. The direct screening approach we propose shows potential for efficient search of desirable REs in natural samples compared to the conventional RE-screening method, and is amenable to being adapted to high-throughput screening of other genotoxic targets.


Sign in / Sign up

Export Citation Format

Share Document