scholarly journals How to make more from exposure data? An integrated machine learning pipeline to predict pathogen exposure

2019 ◽  
Author(s):  
Nicholas M. Fountain-Jones ◽  
Gustavo Machado ◽  
Scott Carver ◽  
Craig Packer ◽  
Mariana Recamonde-Mendoza ◽  
...  

AbstractPredicting infectious disease dynamics is a central challenge in disease ecology. Models that can assess which individuals are most at risk of being exposed to a pathogen not only provide valuable insights into disease transmission and dynamics but can also guide management interventions. Constructing such models for wild animal populations, however, is particularly challenging; often only serological data is available on a subset of individuals and non-linear relationships between variables are common.Here we take advantage of the latest advances in statistical machine learning to construct pathogen-risk models that automatically incorporate complex non-linear relationships with minimal statistical assumptions from ecological data with missing values. Our approach compares multiple machine learning algorithms in a unified environment to find the model with the best predictive performance and uses game theory to better interpret results. We apply this framework on two major pathogens that infect African lions: canine distemper virus (CDV) and feline parvovirus.Our modelling approach provided enhanced predictive performance compared to more traditional approaches, as well as new insights into disease risks in a wild population. We were able to efficiently capture and visualise strong non-linear patterns, as well as model complex interactions between variables in shaping exposure risk from CDV and feline parvovirus. For example, we found that lions were more likely to be exposed to CDV at a young age but only in low rainfall years.When combined with our data calibration approach, our framework helped us to answer questions about risk of pathogen exposure which are difficult to address with previous methods. Our framework not only has the potential to aid in predicting disease risk in animal populations, but also can be used to build robust predictive models suitable for other ecological applications such as modelling species distribution or diversity patterns.

2021 ◽  
Author(s):  
Wei Qiu ◽  
Hugh Chen ◽  
Ayse Berceste Dincer ◽  
Su-In Lee

AbstractExplainable artificial intelligence provides an opportunity to improve prediction accuracy over standard linear models using “black box” machine learning (ML) models while still revealing insights into a complex outcome such as all-cause mortality. We propose the IMPACT (Interpretable Machine learning Prediction of All-Cause morTality) framework that implements and explains complex, non-linear ML models in epidemiological research, by combining a tree ensemble mortality prediction model and an explainability method. We use 133 variables from NHANES 1999–2014 datasets (number of samples: n = 47, 261) to predict all-cause mortality. To explain our model, we extract local (i.e., per-sample) explanations to verify well-studied mortality risk factors, and make new discoveries. We present major factors for predicting x-year mortality (x = 1, 3, 5) across different age groups and their individualized impact on mortality prediction. Moreover, we highlight interactions between risk factors associated with mortality prediction, which leads to findings that linear models do not reveal. We demonstrate that compared with traditional linear models, tree-based models have unique strengths such as: (1) improving prediction power, (2) making no distribution assumptions, (3) capturing non-linear relationships and important thresholds, (4) identifying feature interactions, and (5) detecting different non-linear relationships between models. Given the popularity of complex ML models in prognostic research, combining these models with explainability methods has implications for further applications of ML in medical fields. To our knowledge, this is the first study that combines complex ML models and state-of-the-art feature attributions to explain mortality prediction, which enables us to achieve higher prediction accuracy and gain new insights into the effect of risk factors on mortality.


2021 ◽  
Vol 11 (7) ◽  
pp. 3227
Author(s):  
Lkhagvadorj Munkhdalai ◽  
Keun Ho Ryu ◽  
Oyun-Erdene Namsrai ◽  
Nipon Theera-Umpon

Credit scoring is a process of determining whether a borrower is successful or unsuccessful in repaying a loan using borrowers’ qualitative and quantitative characteristics. In recent years, machine learning algorithms have become widely studied in the development of credit scoring models. Although efficiently classifying good and bad borrowers is a core objective of the credit scoring model, there is still a need for the model that can explain the relationship between input and output. In this work, we propose a novel partially interpretable adaptive softmax (PIA-Soft) regression model to achieve both state-of-the-art predictive performance and marginally interpretation between input and output. We augment softmax regression by neural networks to make it adaptive for each borrower. Our PIA-Soft model consists of two main components: linear (softmax regression) and non-linear (neural network). The linear part explains the fundamental relationship between input and output variables. The non-linear part serves to improve the prediction performance by identifying the non-linear relationship between features for each borrower. The experimental result on public benchmark datasets shows that our proposed model not only outperformed the machine learning baselines but also showed the explanations that logically related to the real-world.


2021 ◽  
Vol 13 (14) ◽  
pp. 2779
Author(s):  
Zhou Zang ◽  
Dan Li ◽  
Yushan Guo ◽  
Wenzhong Shi ◽  
Xing Yan

Artificial intelligence is widely applied to estimate ground-level fine particulate matter (PM2.5) from satellite data by constructing the relationship between the aerosol optical thickness (AOT) and the surface PM2.5 concentration. However, aerosol size properties, such as the fine mode fraction (FMF), are rarely considered in satellite-based PM2.5 modeling, especially in machine learning models. This study investigated the linear and non-linear relationships between fine mode AOT (fAOT) and PM2.5 over five AERONET stations in China (Beijing, Baotou, Taihu, Xianghe, and Xuzhou) using AERONET fAOT and 5-year (2015–2019) ground-level PM2.5 data. Results showed that the fAOT separated by the FMF (fAOT = AOT × FMF) had significant linear and non-linear relationships with surface PM2.5. Then, the Himawari-8 V3.0 and V2.1 FMF and AOT (FMF&AOT-PM2.5) data were tested as input to a deep learning model and four classical machine learning models. The results showed that FMF&AOT-PM2.5 performed better than AOT (AOT-PM2.5) in modelling PM2.5 estimations. The FMF was then applied in satellite-based PM2.5 retrieval over China during 2020, and FMF&AOT-PM2.5 was found to have a better agreement with ground-level PM2.5 than AOT-PM2.5 on dust and haze days. The better linear correlation between PM2.5 and fAOT on both haze and dust days (dust days: R = 0.82; haze days: R = 0.56) compared to AOT (dust days: R = 0.72; haze days: R = 0.52) partly contributed to the superior accuracy of FMF&AOT-PM2.5. This study demonstrates the importance of including the FMF to improve PM2.5 estimations and emphasizes the need for a more accurate FMF product that enables superior PM2.5 retrieval.


2020 ◽  
Vol 27 ◽  
Author(s):  
Gabriela Bitencourt-Ferreira ◽  
Camila Rizzotto ◽  
Walter Filgueira de Azevedo Junior

Background: Analysis of atomic coordinates of protein-ligand complexes can provide three-dimensional data to generate computational models to evaluate binding affinity and thermodynamic state functions. Application of machine learning techniques can create models to assess protein-ligand potential energy and binding affinity. These methods show superior predictive performance when compared with classical scoring functions available in docking programs. Objective: Our purpose here is to review the development and application of the program SAnDReS. We describe the creation of machine learning models to assess the binding affinity of protein-ligand complexes. Method: SAnDReS implements machine learning methods available in the scikit-learn library. This program is available for download at https://github.com/azevedolab/sandres. SAnDReS uses crystallographic structures, binding, and thermodynamic data to create targeted scoring functions. Results: Recent applications of the program SAnDReS to drug targets such as Coagulation factor Xa, cyclin-dependent kinases, and HIV-1 protease were able to create targeted scoring functions to predict inhibition of these proteins. These targeted models outperform classical scoring functions. Conclusion: Here, we reviewed the development of machine learning scoring functions to predict binding affinity through the application of the program SAnDReS. Our studies show the superior predictive performance of the SAnDReS-developed models when compared with classical scoring functions available in the programs such as AutoDock4, Molegro Virtual Docker, and AutoDock Vina.


2020 ◽  
Vol 28 (2) ◽  
pp. 253-265 ◽  
Author(s):  
Gabriela Bitencourt-Ferreira ◽  
Amauri Duarte da Silva ◽  
Walter Filgueira de Azevedo

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.


Author(s):  
Kazutaka Uchida ◽  
Junichi Kouno ◽  
Shinichi Yoshimura ◽  
Norito Kinjo ◽  
Fumihiro Sakakibara ◽  
...  

AbstractIn conjunction with recent advancements in machine learning (ML), such technologies have been applied in various fields owing to their high predictive performance. We tried to develop prehospital stroke scale with ML. We conducted multi-center retrospective and prospective cohort study. The training cohort had eight centers in Japan from June 2015 to March 2018, and the test cohort had 13 centers from April 2019 to March 2020. We use the three different ML algorithms (logistic regression, random forests, XGBoost) to develop models. Main outcomes were large vessel occlusion (LVO), intracranial hemorrhage (ICH), subarachnoid hemorrhage (SAH), and cerebral infarction (CI) other than LVO. The predictive abilities were validated in the test cohort with accuracy, positive predictive value, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and F score. The training cohort included 3178 patients with 337 LVO, 487 ICH, 131 SAH, and 676 CI cases, and the test cohort included 3127 patients with 183 LVO, 372 ICH, 90 SAH, and 577 CI cases. The overall accuracies were 0.65, and the positive predictive values, sensitivities, specificities, AUCs, and F scores were stable in the test cohort. The classification abilities were also fair for all ML models. The AUCs for LVO of logistic regression, random forests, and XGBoost were 0.89, 0.89, and 0.88, respectively, in the test cohort, and these values were higher than the previously reported prediction models for LVO. The ML models developed to predict the probability and types of stroke at the prehospital stage had superior predictive abilities.


Author(s):  
Yue Wu ◽  
Jieqiang Zhu ◽  
Peter Fu ◽  
Weida Tong ◽  
Huixiao Hong ◽  
...  

An effective approach for assessing a drug’s potential to induce autoimmune diseases (ADs) is needed in drug development. Here, we aim to develop a workflow to examine the association between structural alerts and drugs-induced ADs to improve toxicological prescreening tools. Considering reactive metabolite (RM) formation as a well-documented mechanism for drug-induced ADs, we investigated whether the presence of certain RM-related structural alerts was predictive for the risk of drug-induced AD. We constructed a database containing 171 RM-related structural alerts, generated a dataset of 407 AD- and non-AD-associated drugs, and performed statistical analysis. The nitrogen-containing benzene substituent alerts were found to be significantly associated with the risk of drug-induced ADs (odds ratio = 2.95, p = 0.0036). Furthermore, we developed a machine-learning-based predictive model by using daily dose and nitrogen-containing benzene substituent alerts as the top inputs and achieved the predictive performance of area under curve (AUC) of 70%. Additionally, we confirmed the reactivity of the nitrogen-containing benzene substituent aniline and related metabolites using quantum chemistry analysis and explored the underlying mechanisms. These identified structural alerts could be helpful in identifying drug candidates that carry a potential risk of drug-induced ADs to improve their safety profiles.


Sign in / Sign up

Export Citation Format

Share Document