scholarly journals Machine Learning the Redox Potentials of Phenazine Derivatives: A Comparative Study on Molecular Features

Author(s):  
Siddharth Ghule ◽  
Sayan Bagchi ◽  
Kumar Vanka

<div>Electricity generation is a major contributing factor for greenhouse gas emissions. Energy storage systems available today have a combined capacity to store less than 1% of the electricity being consumed worldwide. Redox Flow Batteries (RFBs) are promising candidates for green and efficient energy storage systems. RFBs are being used in renewable energy systems, but their widespread adoption is limited due to high production costs and toxicity associated with the transition-metal-based redox-active species. Therefore, cheaper and greener alternative organic redox-active species are being investigated. Recent reports have shown organic molecules based on phenazine are promising candidates for redox-active species in RFBs. However, the large number of available organic compounds makes the conventional experimental and DFT methods impractical to screen thousands of molecules in a reasonable amount of time. In contrast, machine-learning models have low development time, short prediction time, and high accuracy; thus, are being heavily investigated for virtual screening applications. In this work, we developed machine-learning models to predict the redox potential of phenazine derivatives in DME solvent using a small dataset of 185 molecules. 2D, 3D, and Molecular Fingerprint features were computed using readily available and easy-to-use python libraries, making our approach easily adaptable to similar work. Twenty linear and non-linear machine-learning models were investigated in this work. These models achieved excellent performance on the unseen data (i.e., R<sup>2</sup> > 0.98, MSE < 0.008 V2 and MAE < 0.07 V). Model performance was assessed in a consistent manner using the training and evaluation pipeline developed in this work. We showed that 2D molecular features are most informative and achieve the best prediction accuracy among four feature sets. We also showed that often less preferred but relatively faster linear models could perform better than non-linear models when the feature set contains different types of features (i.e., 2D, 3D, and Molecular Fingerprints). Further investigations revealed that it is possible to reduce the training and inference time without sacrificing prediction accuracy by using a small subset of features. Moreover, models were able to predict the previously reported promising redox-active compounds with high accuracy. Also, significantly low prediction errors were observed for the functional groups. Although some functional groups had only one compound in the training set, best-performing models could achieve errors (MAPE) less than 10%. The major source of error was a lack of data near-zero and in the positive region. Therefore, this work shows that it is possible to develop accurate machine-learning models that could potentially screen millions of compounds in a short amount of time with a small training set and limited number of easy to compute features. Thus, results obtained in this report would help in the adoption of green energy by accelerating the field of materials discovery for energy storage applications.</div>

2021 ◽  
Author(s):  
Siddharth Ghule ◽  
Sayan Bagchi ◽  
Kumar Vanka

<div>Electricity generation is a major contributing factor for greenhouse gas emissions. Energy storage systems available today have a combined capacity to store less than 1% of the electricity being consumed worldwide. Redox Flow Batteries (RFBs) are promising candidates for green and efficient energy storage systems. RFBs are being used in renewable energy systems, but their widespread adoption is limited due to high production costs and toxicity associated with the transition-metal-based redox-active species. Therefore, cheaper and greener alternative organic redox-active species are being investigated. Recent reports have shown organic molecules based on phenazine are promising candidates for redox-active species in RFBs. However, the large number of available organic compounds makes the conventional experimental and DFT methods impractical to screen thousands of molecules in a reasonable amount of time. In contrast, machine-learning models have low development time, short prediction time, and high accuracy; thus, are being heavily investigated for virtual screening applications. In this work, we developed machine-learning models to predict the redox potential of phenazine derivatives in DME solvent using a small dataset of 185 molecules. 2D, 3D, and Molecular Fingerprint features were computed using readily available and easy-to-use python libraries, making our approach easily adaptable to similar work. Twenty linear and non-linear machine-learning models were investigated in this work. These models achieved excellent performance on the unseen data (i.e., R<sup>2</sup> > 0.98, MSE < 0.008 V2 and MAE < 0.07 V). Model performance was assessed in a consistent manner using the training and evaluation pipeline developed in this work. We showed that 2D molecular features are most informative and achieve the best prediction accuracy among four feature sets. We also showed that often less preferred but relatively faster linear models could perform better than non-linear models when the feature set contains different types of features (i.e., 2D, 3D, and Molecular Fingerprints). Further investigations revealed that it is possible to reduce the training and inference time without sacrificing prediction accuracy by using a small subset of features. Moreover, models were able to predict the previously reported promising redox-active compounds with high accuracy. Also, significantly low prediction errors were observed for the functional groups. Although some functional groups had only one compound in the training set, best-performing models could achieve errors (MAPE) less than 10%. The major source of error was a lack of data near-zero and in the positive region. Therefore, this work shows that it is possible to develop accurate machine-learning models that could potentially screen millions of compounds in a short amount of time with a small training set and limited number of easy to compute features. Thus, results obtained in this report would help in the adoption of green energy by accelerating the field of materials discovery for energy storage applications.</div>


2021 ◽  
Author(s):  
Siddharth Ghule ◽  
Sayan Bagchi ◽  
Kumar Vanka

<div>Electricity generation is a major contributing factor for greenhouse gas emissions. Energy storage systems available today have a combined capacity to store less than 1% of the electricity being consumed worldwide. Redox Flow Batteries (RFBs) are promising candidates for green and efficient energy storage systems. RFBs are being used in renewable energy systems, but their widespread adoption is limited due to high production costs and toxicity associated with the transition-metal-based redox-active species. Therefore, cheaper and greener alternative organic redox-active species are being investigated. Recent reports have shown organic molecules based on phenazine are promising candidates for redox-active species in RFBs. However, the large number of available organic compounds makes the conventional experimental and DFT methods impractical to screen thousands of molecules in a reasonable amount of time. In contrast, machine-learning models have low development time, short prediction time, and high accuracy; thus, are being heavily investigated for virtual screening applications. In this work, we developed machine-learning models to predict the redox potential of phenazine derivatives in DME solvent using a small dataset of 185 molecules. 2D, 3D, and Molecular Fingerprint features were computed using readily available and easy-to-use python libraries, making our approach easily adaptable to similar work. Twenty linear and non-linear machine-learning models were investigated in this work. These models achieved excellent performance on the unseen data (i.e., R<sup>2</sup> > 0.98, MSE < 0.008 V2 and MAE < 0.07 V). Model performance was assessed in a consistent manner using the training and evaluation pipeline developed in this work. We showed that 2D molecular features are most informative and achieve the best prediction accuracy among four feature sets. We also showed that often less preferred but relatively faster linear models could perform better than non-linear models when the feature set contains different types of features (i.e., 2D, 3D, and Molecular Fingerprints). Further investigations revealed that it is possible to reduce the training and inference time without sacrificing prediction accuracy by using a small subset of features. Moreover, models were able to predict the previously reported promising redox-active compounds with high accuracy. Also, significantly low prediction errors were observed for the functional groups. Although some functional groups had only one compound in the training set, best-performing models could achieve errors (MAPE) less than 10%. The major source of error was a lack of data near-zero and in the positive region. Therefore, this work shows that it is possible to develop accurate machine-learning models that could potentially screen millions of compounds in a short amount of time with a small training set and limited number of easy to compute features. Thus, results obtained in this report would help in the adoption of green energy by accelerating the field of materials discovery for energy storage applications.</div>


2021 ◽  
Author(s):  
Scott Kulm ◽  
Lior Kofman ◽  
Jason Mezey ◽  
Olivier Elemento

ABSTRACTA patient’s risk for cancer is usually estimated through simple linear models that sum effect sizes of proven risk factors. In theory, more advanced machine learning models can be used for the same task. Using data from the UK Biobank, a large prospective health study, we have developed linear and machine learning models for the prediction of 12 different cancers diagnoses within a 10 year time span. We find that the top machine learning algorithm, XGBoost (XGB), trained on 707 features generated an average area under the receiver operator curve of 0.736 (with a range of 0.65-0.85). Linear models trained with only 10 features were found to be statistically indifferent from the machine learning performance. The linear models were significantly more accurate than the prominent QCancer models (p = 0.0019), which are trained on 45 million patient records and available to over 4,000 United Kingdom general practices. The increase in accuracy may be caused by the consideration of often omitted feature types, including survey answers, census records, and genetic information. This approach led to the discovery of significant novel risk features, including self-reported happiness with own health (relevant to 12 cancers), measured testosterone (relevant to 8 cancers), and ICD codes for rehabilitation procedures (relevant to 3 cancers). These ten feature models can be easily implemented within the clinic, allowing for personalized screening schedules that may increase the cancer survival within a population.


PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0249285
Author(s):  
Limin Yu ◽  
Alexandra Halalau ◽  
Bhavinkumar Dalal ◽  
Amr E. Abbas ◽  
Felicia Ivascu ◽  
...  

Background The Coronavirus disease 2019 (COVID-19) pandemic has affected millions of people across the globe. It is associated with a high mortality rate and has created a global crisis by straining medical resources worldwide. Objectives To develop and validate machine-learning models for prediction of mechanical ventilation (MV) for patients presenting to emergency room and for prediction of in-hospital mortality once a patient is admitted. Methods Two cohorts were used for the two different aims. 1980 COVID-19 patients were enrolled for the aim of prediction ofMV. 1036 patients’ data, including demographics, past smoking and drinking history, past medical history and vital signs at emergency room (ER), laboratory values, and treatments were collected for training and 674 patients were enrolled for validation using XGBoost algorithm. For the second aim to predict in-hospital mortality, 3491 hospitalized patients via ER were enrolled. CatBoost, a new gradient-boosting algorithm was applied for training and validation of the cohort. Results Older age, higher temperature, increased respiratory rate (RR) and a lower oxygen saturation (SpO2) from the first set of vital signs were associated with an increased risk of MV amongst the 1980 patients in the ER. The model had a high accuracy of 86.2% and a negative predictive value (NPV) of 87.8%. While, patients who required MV, had a higher RR, Body mass index (BMI) and longer length of stay in the hospital were the major features associated with in-hospital mortality. The second model had a high accuracy of 80% with NPV of 81.6%. Conclusion Machine learning models using XGBoost and catBoost algorithms can predict need for mechanical ventilation and mortality with a very high accuracy in COVID-19 patients.


2018 ◽  
Vol 20 (47) ◽  
pp. 30006-30020 ◽  
Author(s):  
Wenwen Li ◽  
Yasunobu Ando

Recently, the machine learning (ML) force field has emerged as a powerful atomic simulation approach because of its high accuracy and low computational cost.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 285-285
Author(s):  
Vanessa Rotondo ◽  
Dan Tulpan ◽  
Katharine M Wood ◽  
Marlene Paibomesai ◽  
Vern R Osborne

Abstract The objective of this study is to investigate how linear body measurements relate to and can be used to predict calf body weight using linear and machine learning models. To meet these objectives, a total of 103 Angus cross calves were enrolled in the study from wk 2 - 8. Calves were weighed and linear measurements were collected weekly, such as: poll to nose, width across the eyes (WE), width across the right ear, neck length, wither height, heart girth (HG), midpiece height (MH), midpiece circumference, midpiece width (MW), midpiece depth (MD), hook height, hook width, pin height, top of pin bones width (PW), width across the ends of pin bones, nose to tail body length, the length between the withers and pins, forearm to hoof, cannon bone to hoof. These measurements were taken using a commercial soft tape measure and calipers. To assess relationships between traits and to fit a model to predict BW, data were analyzed using the Weka (The University of Waikato, New Zealand) software using both linear regression (LR) and random forest (RF) machine learning models. The models were trained using a 10-fold cross-validation approach. The automatically derived LR model used 11 traits to fit the data to weekly BW (r2 = 0.97), where the traits with the highest coefficients were HG, PW and WE. The RF model improved further the BW predictions (r2= 0.98). Additionally, sex differences were examined. Although the BW model continued to fit well (r2 0.97), some of the top linear traits differed. The results of this study suggest that linear models built on linear measurements can accurately estimate body weight in beef calves, and that machine learning can further improve the model fit.


Sensors ◽  
2021 ◽  
Vol 21 (6) ◽  
pp. 2016
Author(s):  
Claudia Gonzalez Viejo ◽  
Eden Tongson ◽  
Sigfredo Fuentes

Aroma is one of the main attributes that consumers consider when appreciating and selecting a coffee; hence it is considered an important quality trait. However, the most common methods to assess aroma are based on expensive equipment or human senses through sensory evaluation, which is time-consuming and requires highly trained assessors to avoid subjectivity. Therefore, this study aimed to estimate the coffee intensity and aromas using a low-cost and portable electronic nose (e-nose) and machine learning modeling. For this purpose, triplicates of six commercial coffee samples with different intensity levels were used for this study. Two machine learning models were developed based on artificial neural networks using the data from the e-nose as inputs to (i) classify the samples into low, medium, and high-intensity (Model 1) and (ii) to predict the relative abundance of 45 different aromas (Model 2). Results showed that it is possible to estimate the intensity of coffees with high accuracy (98%; Model 1), as well as to predict the specific aromas obtaining a high correlation coefficient (R = 0.99), and no under- or over-fitting of the models were detected. The proposed contactless, nondestructive, rapid, reliable, and low-cost method showed to be effective in evaluating volatile compounds in coffee, which is a potential technique to be applied within all stages of the production process to detect any undesirable characteristics on–time and ensure high-quality products.


Author(s):  
Linyan Chen ◽  
Hao Zeng ◽  
Yu Xiang ◽  
Yeqian Huang ◽  
Yuling Luo ◽  
...  

Histopathological images and omics profiles play important roles in prognosis of cancer patients. Here, we extracted quantitative features from histopathological images to predict molecular characteristics and prognosis, and integrated image features with mutations, transcriptomics, and proteomics data for prognosis prediction in lung adenocarcinoma (LUAD). Patients obtained from The Cancer Genome Atlas (TCGA) were divided into training set (n = 235) and test set (n = 235). We developed machine learning models in training set and estimated their predictive performance in test set. In test set, the machine learning models could predict genetic aberrations: ALK (AUC = 0.879), BRAF (AUC = 0.847), EGFR (AUC = 0.855), ROS1 (AUC = 0.848), and transcriptional subtypes: proximal-inflammatory (AUC = 0.897), proximal-proliferative (AUC = 0.861), and terminal respiratory unit (AUC = 0.894) from histopathological images. Moreover, we obtained tissue microarrays from 316 LUAD patients, including four external validation sets. The prognostic model using image features was predictive of overall survival in test and four validation sets, with 5-year AUCs from 0.717 to 0.825. High-risk and low-risk groups stratified by the model showed different survival in test set (HR = 4.94, p &lt; 0.0001) and three validation sets (HR = 1.64–2.20, p &lt; 0.05). The combination of image features and single omics had greater prognostic power in test set, such as histopathology + transcriptomics model (5-year AUC = 0.840; HR = 7.34, p &lt; 0.0001). Finally, the model integrating image features with multi-omics achieved the best performance (5-year AUC = 0.908; HR = 19.98, p &lt; 0.0001). Our results indicated that the machine learning models based on histopathological image features could predict genetic aberrations, transcriptional subtypes, and survival outcomes of LUAD patients. The integration of histopathological images and multi-omics may provide better survival prediction for LUAD.


2021 ◽  
Author(s):  
Amin Alibakhshi ◽  
Bernd Hartke

Unraveling challenging problems by machine learning has recently become a hot topic in many scientific disciplines. For developing rigorous machine-learning models to study problems of interest in molecular sciences, translating molecular structures to quantitative representations as suitable machine-learning inputs plays a central role. Many different molecular representations and the state-ofthe- art ones, although efficient in studying numerous molecular features, still are sub-optimal in many challenging cases, as discussed in the context of present research. The main aim of the present study is to introduce the Implicitly Perturbed Hamiltonian (ImPerHam) as a class of versatile representations for more efficient machine learning of challenging problems in molecular sciences. ImPerHam representations are defined as energy attributes of the molecular Hamiltonian, implicitly perturbed by a number of hypothetic or real arbitrary solvents based on continuum solvation models. We demonstrate outstanding performance of machine-learning models based on ImPerHam representations for three diverse and challenging cases of predicting inhibition of the CYP450 enzyme, high precision and transferrable evaluation of conformational energy of molecular systems and accurately reproducing solvation free energies for large benchmark sets.


Author(s):  
Christoph M. Kanzler ◽  
Ilse Lamers ◽  
Peter Feys ◽  
Roger Gassert ◽  
Olivier Lambercy

AbstractBackgroundA personalized prediction of upper limb neurorehabilitation outcomes in persons with multiple sclerosis (pwMS) promises to optimize the allocation of therapy and to stratify individuals for resource-demanding clinical trials. Previous research identified predictors on a population level through linear models and clinical data, including conventional assessments describing sensorimotor impairments. The objective of this work was to explore the feasibility of providing an individualized and more accurate prediction of rehabilitation outcomes in pwMS by leveraging non-linear machine learning models, clinical data, and digital health metrics characterizing sensorimotor impairments.MethodsClinical data and digital health metrics were recorded from eleven pwMS undergoing neurorehabilitation. Machine learning models were trained on data recorded pre-intervention. The dependent variables indicated whether a considerable improvement on the activity level was observed across the intervention or not (binary classification), as defined by the Action Research Arm Test (ARAT), Box and Block Test (BBT), or Nine Hole Peg Test (NHPT).ResultsIn a cross-validation, considerable improvements in ARAT or BBT could be accurately predicted (94% balanced accuracy) by only relying on patient master data. Considerable improvements in NHPT could be accurately predicted (89% balanced accuracy), but required knowledge about sensorimotor impairments. Assessing these with digital health metrics instead of conventional scales allowed increasing the balanced accuracy by +17% . Non-linear machine-learning models improved the predictive accuracy for the NHPT by +25% compared to linear models.ConclusionsThis work demonstrates the feasibility of a personalized prediction of upper limb neurorehabilitation outcomes in pwMS using multi-modal data collected before neurorehabilitation and machine learning. Information from digital health metrics about sensorimotor impairment was necessary to predict changes in dexterous hand control, thereby underlining their potential to provide a more sensitive and fine-grained assessment than conventional scales. Non-linear models outperformed ones, suggesting that the commonly assumed linearity of neurorehabilitation is oversimplified.clinicaltrials.gov registration number: NCT02688231


Sign in / Sign up

Export Citation Format

Share Document