scholarly journals A Physics-Infused Deep Learning Model for the Prediction of Refractive Indices and Its Use for the Large-Scale Screening of Organic Compound Space

2019 ◽  
Author(s):  
Mojtaba Haghighatlari ◽  
Gaurav Vishwakarma ◽  
Mohammad Atif Faiz Afzal ◽  
Johannes Hachmann

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>

2019 ◽  
Author(s):  
Mojtaba Haghighatlari ◽  
Gaurav Vishwakarma ◽  
Mohammad Atif Faiz Afzal ◽  
Johannes Hachmann

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>


2020 ◽  
Author(s):  
Zakhriya Alhassan ◽  
MATTHEW WATSON ◽  
David Budgen ◽  
Riyad Alshammari ◽  
Ali Alessan ◽  
...  

BACKGROUND Predicting the risk of glycated hemoglobin (HbA1c) elevation can help identify patients with the potential for developing serious chronic health problems such as diabetes and cardiovascular diseases. Early preventive interventions based upon advanced predictive models using electronic health records (EHR) data for such patients can ultimately help provide better health outcomes. OBJECTIVE Our study investigates the performance of predictive models to forecast HbA1c elevation levels by employing machine learning approaches using data from current and previous visits in the EHR systems for patients who had not been previously diagnosed with any type of diabetes. METHODS This study employed one statistical model and three commonly used conventional machine learning models, as well as a deep learning model, to predict patients’ current levels of HbA1c. For the deep learning model, we also integrated current visit data with historical (longitudinal) data from previous visits. Explainable machine learning methods were used to interrogate the models and have an understanding of the reasons behind the models' decisions. All models were trained and tested using a large and naturally balanced dataset from Saudi Arabia with 18,844 unique patient records. RESULTS The machine learning models achieved the best results for predicting current HbA1c elevation risk. The deep learning model outperformed the statistical and conventional machine learning models with respect to all reported measures when employing time-series data. The best performing model was the multi-layer perceptron (MLP) which achieved an accuracy of 74.52% when used with historical data. CONCLUSIONS This study shows that machine learning models can provide promising results for the task of predicting current HbA1c levels. For deep learning in particular, utilizing the patient's longitudinal time-series data improved the performance and affected the relative importance for the predictors used. The models showed robust results that were consistent with comparable studies.


2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Bader Alouffi ◽  
Abdullah Alharbi ◽  
Radhya Sahal ◽  
Hager Saleh

Fake news is challenging to detect due to mixing accurate and inaccurate information from reliable and unreliable sources. Social media is a data source that is not trustworthy all the time, especially in the COVID-19 outbreak. During the COVID-19 epidemic, fake news is widely spread. The best way to deal with this is early detection. Accordingly, in this work, we have proposed a hybrid deep learning model that uses convolutional neural network (CNN) and long short-term memory (LSTM) to detect COVID-19 fake news. The proposed model consists of some layers: an embedding layer, a convolutional layer, a pooling layer, an LSTM layer, a flatten layer, a dense layer, and an output layer. For experimental results, three COVID-19 fake news datasets are used to evaluate six machine learning models, two deep learning models, and our proposed model. The machine learning models are DT, KNN, LR, RF, SVM, and NB, while the deep learning models are CNN and LSTM. Also, four matrices are used to validate the results: accuracy, precision, recall, and F1-measure. The conducted experiments show that the proposed model outperforms the six machine learning models and the two deep learning models. Consequently, the proposed system is capable of detecting the fake news of COVID-19 significantly.


2020 ◽  
Author(s):  
Hirofumi Obinata ◽  
Peiying Ruan ◽  
Hitoshi Mori ◽  
Wentao Zhu ◽  
Hisashi Sasaki ◽  
...  

Abstract This study investigated the utility of artificial intelligence in predicting disease progression. We analysed 194 patients with COVID-19 confirmed by reverse transcription polymerase chain reaction. Among them, 31 patients had oxygen therapy administered after admission. To assess the utility of artificial intelligence in the prediction of disease progression, we used three machine learning models employing clinical features (patient’s background, laboratory data, and symptoms), one deep learning model employing computed tomography (CT) images, and one multimodal deep learning model employing a combination of clinical features and CT images. We also evaluated the predictive values of these models and analysed the important features required to predict worsening in cases of COVID-19. The multimodal deep learning model had the highest accuracy. The CT image was an important feature of multimodal deep learning model. The area under the curve of all machine learning models employing clinical features and the deep learning model employing CT images exceeded 90%, and sensitivity of these models exceeded 95%. C-reactive protein and lactate dehydrogenase were important features of machine learning models. Our machine learning model, while slightly less accurate than the multimodal model, still provides a valuable medical triage tool for patients in the early stages of COVID-19.


2020 ◽  
Vol 34 (7) ◽  
pp. 717-730 ◽  
Author(s):  
Matthew C. Robinson ◽  
Robert C. Glen ◽  
Alpha A. Lee

Abstract Machine learning methods may have the potential to significantly accelerate drug discovery. However, the increasing rate of new methodological approaches being published in the literature raises the fundamental question of how models should be benchmarked and validated. We reanalyze the data generated by a recently published large-scale comparison of machine learning models for bioactivity prediction and arrive at a somewhat different conclusion. We show that the performance of support vector machines is competitive with that of deep learning methods. Additionally, using a series of numerical experiments, we question the relevance of area under the receiver operating characteristic curve as a metric in virtual screening. We further suggest that area under the precision–recall curve should be used in conjunction with the receiver operating characteristic curve. Our numerical experiments also highlight challenges in estimating the uncertainty in model performance via scaffold-split nested cross validation.


Author(s):  
S. Sasikala ◽  
S. J. Subhashini ◽  
P. Alli ◽  
J. Jane Rubel Angelina

Machine learning is a technique of parsing data, learning from that data, and then applying what has been learned to make informed decisions. Deep learning is actually a subset of machine learning. It technically is machine learning and functions in the same way, but it has different capabilities. The main difference between deep and machine learning is, machine learning models become well progressively, but the model still needs some guidance. If a machine learning model returns an inaccurate prediction, then the programmer needs to fix that problem explicitly, but in the case of deep learning, the model does it by itself. Automatic car driving system is a good example of deep learning. On other hand, Artificial Intelligence is a different thing from machine learning and deep learning. Deep learning and machine learning both are the subsets of AI.


2020 ◽  
Vol 245 ◽  
pp. 06019
Author(s):  
Kim Albertsson ◽  
Sitong An ◽  
Sergei Gleyzer ◽  
Lorenzo Moneta ◽  
Joana Niermann ◽  
...  

ROOT provides, through TMVA, machine learning tools for data analysis at HEP experiments and beyond. We present recently included features in TMVA and the strategy for future developments in the diversified machine learning landscape. Focus is put on fast machine learning inference, which enables analysts to deploy their machine learning models rapidly on large scale datasets. The new developments are paired with newly designed C++ and Python interfaces supporting modern C++ paradigms and full interoperability in the Python ecosystem. We present as well a new deep learning implementation for convolutional neural network using the cuDNN library for GPU. We show benchmarking results in term of training time and inference time, when comparing with other machine learning libraries such as Keras/Tensorflow.


2021 ◽  
Vol 3 ◽  
Author(s):  
Yue Guo ◽  
Changye Li ◽  
Carol Roan ◽  
Serguei Pakhomov ◽  
Trevor Cohen

Large amounts of labeled data are a prerequisite to training accurate and reliable machine learning models. However, in the medical domain in particular, this is also a stumbling block as accurately labeled data are hard to obtain. DementiaBank, a publicly available corpus of spontaneous speech samples from a picture description task widely used to study Alzheimer's disease (AD) patients' language characteristics and for training classification models to distinguish patients with AD from healthy controls, is relatively small—a limitation that is further exacerbated when restricting to the balanced subset used in the Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) challenge. We build on previous work showing that the performance of traditional machine learning models on DementiaBank can be improved by the addition of normative data from other sources, evaluating the utility of such extrinsic data to further improve the performance of state-of-the-art deep learning based methods on the ADReSS challenge dementia detection task. To this end, we developed a new corpus of professionally transcribed recordings from the Wisconsin Longitudinal Study (WLS), resulting in 1366 additional Cookie Theft Task transcripts, increasing the available training data by an order of magnitude. Using these data in conjunction with DementiaBank is challenging because the WLS metadata corresponding to these transcripts do not contain dementia diagnoses. However, cognitive status of WLS participants can be inferred from results of several cognitive tests including semantic verbal fluency available in WLS data. In this work, we evaluate the utility of using the WLS ‘controls’ (participants without indications of abnormal cognitive status), and these data in conjunction with inferred ‘cases’ (participants with such indications) for training deep learning models to discriminate between language produced by patients with dementia and healthy controls. We find that incorporating WLS data during training a BERT model on ADReSS data improves its performance on the ADReSS dementia detection task, supporting the hypothesis that incorporating WLS data adds value in this context. We also demonstrate that weighted cost functions and additional prediction targets may be effective ways to address issues arising from class imbalance and confounding effects due to data provenance.


2021 ◽  
Author(s):  
Liyang Wang ◽  
Dantong Niu ◽  
Xinjie Zhao ◽  
Xiaoya Wang ◽  
Mengzhen Hao ◽  
...  

AbstractTraditional food allergen identification mainly relies on in vivo and in vitro experiments, which often needs a long period and high cost. The artificial intelligence (AI)-driven rapid food allergen identification method has solved the two drawbacks and is becoming an efficient auxiliary tool. Aiming to overcome the limitations of lower accuracy of traditional machine learning models in predicting the allergenicity of food allergens, this work proposed to introduce transformer deep learning model with self-attention mechanism and ensemble learning model (representative as Light Gradient Boosting Machine (LightGBM) and eXtreme Gradient Boosting (XGBoost)) to solve the problem. In order to highlight the superiority of the proposed novel method, the study also selected various commonly used machine learning models as the baseline classifiers. The results of 5-fold cross-validation found that the AUC of the deep model was the highest (0.9400), which was better than the ensemble learning and baseline algorithms. But it needed to be pre-trained, and the training cost was highest. By comparing the characteristics of transformer model and boosting models, it can be analyzed that the two types of models have their own advantages, which provides novel clues and inspiration for the rapid prediction of food allergens in the future.


Sign in / Sign up

Export Citation Format

Share Document