Prospective and external validation of stroke discharge planning machine learning models

2022 ◽  
Vol 96 ◽  
pp. 80-84
Author(s):  
Stephen Bacchi ◽  
Luke Oakden-Rayner ◽  
David K Menon ◽  
Andrew Moey ◽  
Jim Jannes ◽  
...  
2020 ◽  
Author(s):  
Govinda KC ◽  
Giovanni Bocci ◽  
Srijan Verma ◽  
Mahmudulla Hassan ◽  
Jayme Holmes ◽  
...  

<p>Strategies for drug discovery and repositioning are an urgent need with respect to COVID-19. We developed "REDIAL-2020", a suite of machine learning models for estimating small molecule activity from molecular structure, for a range of SARS-CoV-2 related assays. Each classifier is based on three distinct types of descriptors (fingerprint, physicochemical, and pharmacophore) for parallel model development. These models were trained using high throughput screening data from the NCATS COVID19 portal (https://opendata.ncats.nih.gov/covid19/index.html), with multiple categorical machine learning algorithms. The “best models” are combined in an ensemble consensus predictor that outperforms single models where external validation is available. This suite of machine learning models is available through the DrugCentral web portal (<a href="https://drugdiscovery.utep.edu/redial">http://drugcentral.org/Redial</a>). Acceptable input formats are: drug name, PubChem CID, or SMILES; the output is an estimate of anti-SARS-CoV-2 activities. The web application reports estimated activity across three areas (<i>viral entry</i>, <i>viral replication,</i> and <i>live virus infectivity</i>) spanning six independent models, followed by a similarity search that displays the most similar molecules to the query among experimentally determined data. The ML models have 60% to 74% external predictivity, based on three separate datasets. Complementing the NCATS COVID19 portal, REDIAL-2020 can serve as a rapid online tool for identifying active molecules for COVID-19 treatment. The source code and specific models are available through Github (<a href="https://github.com/sirimullalab/ncats_covid">https://github.com/sirimullalab/</a>redial-2020), or via Docker Hub (https://hub.docker.com/r/sirimullalab/redial-2020) for users preferring a containerized version.</p>


2020 ◽  
Author(s):  
Janmajay Singh ◽  
Masahiro Sato ◽  
Tomoko Ohkuma

BACKGROUND Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. OBJECTIVE The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. METHODS A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. RESULTS Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. CONCLUSIONS This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.


2020 ◽  
Author(s):  
Govinda KC ◽  
Giovanni Bocci ◽  
Srijan Verma ◽  
Mahmudulla Hassan ◽  
Jayme Holmes ◽  
...  

<p>Strategies for drug discovery and repositioning are an urgent need with respect to COVID-19. We developed "REDIAL-2020", a suite of machine learning models for estimating small molecule activity from molecular structure, for a range of SARS-CoV-2 related assays. Each classifier is based on three distinct types of descriptors (fingerprint, physicochemical, and pharmacophore) for parallel model development. These models were trained using high throughput screening data from the NCATS COVID19 portal (https://opendata.ncats.nih.gov/covid19/index.html), with multiple categorical machine learning algorithms. The “best models” are combined in an ensemble consensus predictor that outperforms single models where external validation is available. This suite of machine learning models is available through the DrugCentral web portal (<a href="https://drugdiscovery.utep.edu/redial">http://drugcentral.org/Redial</a>). Acceptable input formats are: drug name, PubChem CID, or SMILES; the output is an estimate of anti-SARS-CoV-2 activities. The web application reports estimated activity across three areas (<i>viral entry</i>, <i>viral replication,</i> and <i>live virus infectivity</i>) spanning six independent models, followed by a similarity search that displays the most similar molecules to the query among experimentally determined data. The ML models have 60% to 74% external predictivity, based on three separate datasets. Complementing the NCATS COVID19 portal, REDIAL-2020 can serve as a rapid online tool for identifying active molecules for COVID-19 treatment. The source code and specific models are available through Github (<a href="https://github.com/sirimullalab/ncats_covid">https://github.com/sirimullalab/</a>redial-2020), or via Docker Hub (https://hub.docker.com/r/sirimullalab/redial-2020) for users preferring a containerized version.</p>


Author(s):  
M. VALKEMA ◽  
H. LINGSMA ◽  
P. LAMBIN ◽  
J. VAN LANSCHOT

Biostatistics versus machine learning: from traditional prediction models to automated medical analysis Machine learning is increasingly applied to medical data to develop clinical prediction models. This paper discusses the application of machine learning in comparison with traditional biostatistical methods. Biostatistics is well-suited for structured datasets. The selection of variables for a biostatistical prediction model is primarily knowledge-driven. A similar approach is possible with machine learning. But in addition, machine learning allows for analysis of unstructured datasets, which are e.g. derived from medical imaging and written texts in patient records. In contrast to biostatistics, the selection of variables with machine learning is mainly data-driven. Complex machine learning models are able to detect nonlinear patterns and interactions in data. However, this requires large datasets to prevent overfitting. For both machine learning and biostatistics, external validation of a developed model in a comparable setting is required to evaluate a model’s reproducibility. Machine learning models are not easily implemented in clinical practice, since they are recognized as black boxes (i.e. non-intuitive). For this purpose, research initiatives are ongoing within the field of explainable artificial intelligence. Finally, the application of machine learning for automated imaging analysis and development of clinical decision support systems is discussed.


2020 ◽  
Author(s):  
Govinda KC ◽  
Giovanni Bocci ◽  
Srijan Verma ◽  
Mahmudulla Hassan ◽  
Jayme Holmes ◽  
...  

<p>Strategies for drug discovery and repositioning are an urgent need with respect to COVID-19. We developed "REDIAL-2020", a suite of machine learning models for estimating small molecule activity from molecular structure, for a range of SARS-CoV-2 related assays. Each classifier is based on three distinct types of descriptors (fingerprint, physicochemical, and pharmacophore) for parallel model development. These models were trained using high throughput screening data from the NCATS COVID19 portal (https://opendata.ncats.nih.gov/covid19/index.html), with multiple categorical machine learning algorithms. The “best models” are combined in an ensemble consensus predictor that outperforms single models where external validation is available. This suite of machine learning models is available through the DrugCentral web portal (<a href="https://drugdiscovery.utep.edu/redial">http://drugcentral.org/Redial</a>). Acceptable input formats are: drug name, PubChem CID, or SMILES; the output is an estimate of anti-SARS-CoV-2 activities. The web application reports estimated activity across three areas (<i>viral entry</i>, <i>viral replication,</i> and <i>live virus infectivity</i>) spanning six independent models, followed by a similarity search that displays the most similar molecules to the query among experimentally determined data. The ML models have 60% to 74% external predictivity, based on three separate datasets. Complementing the NCATS COVID19 portal, REDIAL-2020 can serve as a rapid online tool for identifying active molecules for COVID-19 treatment. The source code and specific models are available through Github (<a href="https://github.com/sirimullalab/ncats_covid">https://github.com/sirimullalab/</a>redial-2020), or via Docker Hub (https://hub.docker.com/r/sirimullalab/redial-2020) for users preferring a containerized version.</p>


Author(s):  
Federico Cabitza ◽  
Andrea Campagner ◽  
Felipe Soares ◽  
Luis Garcia de Guadiana Romualdo ◽  
Feyissa Challa ◽  
...  

10.2196/25022 ◽  
2021 ◽  
Vol 9 (12) ◽  
pp. e25022
Author(s):  
Janmajay Singh ◽  
Masahiro Sato ◽  
Tomoko Ohkuma

Background Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. Objective The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. Methods A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. Results Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. Conclusions This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.


2020 ◽  
Author(s):  
Qin-Yu Zhao ◽  
Le-Ping Liu ◽  
Jing-Chao Luo ◽  
Yan-Wei Luo ◽  
Huan Wang ◽  
...  

Abstract Background Sepsis-induced coagulopathy (SIC) denotes an increased mortality rate and poorer prognosis in septic patients. Methods Machine-learning models were developed based on septic patients who were older than 18 years and stayed in intensive care units (ICUs) for more than 24 hours in Medical Information Mart for Intensive Care (MIMIC)-IV. Eighty-eight potential predictors were extracted, and 15 various machine-learning models assessed the daily risk of SIC. The most potent model was selected based on its accuracy and Area Under the receiver operating characteristic Curve (AUC), followed by fine-grained hyperparameter adjustment using the Bayesian Optimization Algorithm. The effects of features on prediction scores were measured using the SHapley Additive exPlanations (SHAP) values. A compact model was developed, based on 15 features selected according to their importance and clinical availability. Two models were compared with Logistic Regression and SIC scores in terms of SIC prediction. Additionally, an external validation was performed in the eICU Collaborative Research Database (eICU-CRD). Results Of 11362 patients in MIMIC-IV included in the final cohort, a total of 6744 (59%) patients had SIC during sepsis, and 16183 samples were extracted. The model named Categorical Boosting (CatBoost) had the greatest AUC in our study (0.869 [0.850, 0.886]). Coagulation profile and renal function indicators are the most important features to predict SIC. A compact model was developed with the AUC of 0.854 [0.832, 0.872], while the AUCs of Logistic Regression and SIC scores were 0.746 [0.735, 0.755] and 0.709 [0.687, 0.733], respectively. A cohort of 35252 septic patients in eICU-CRD was analyzed. The AUCs of the full and the compact models in external validation were 0.842 [0.837, 0.846] and 0.803 [0.798, 0.809], respectively, which were still larger than those of Logistic Regression (0.660 [0.653, 0.667]) and SIC scores (0.752 [0.747, 0.757]). Prediction results can be illustrated by using SHAP values in the instance level, which makes our models clinically interpretable. Conclusions We developed two models which were able to dynamically predict the risk of SIC in septic patients better than conventional Logistic Regression and SIC scores. Prediction results of our two models can be interpreted by using SHAP values.


2020 ◽  
Author(s):  
Govinda KC ◽  
Giovanni Bocci ◽  
Srijan Verma ◽  
Md Hassan ◽  
Jayme Holmes ◽  
...  

Abstract Strategies for drug discovery and repositioning are an urgent need with respect to COVID-19. We developed "REDIAL-2020", a suite of machine learning models for estimating small molecule activity from molecular structure, for a range of SARS-CoV-2 related assays. Each classifier is based on three distinct types of descriptors (fingerprint, physicochemical, and pharmacophore) for parallel model development. These models were trained using high throughput screening data from the NCATS COVID19 portal (https://opendata.ncats.nih.gov/covid19/index.html), with multiple categorical machine learning algorithms. The “best models” are combined in an ensemble consensus predictor that outperforms single models where external validation is available. This suite of machine learning models is available through the DrugCentral web portal (http://drugcentral.org/Redial). Acceptable input formats are: drug name, PubChem CID, or SMILES; the output is an estimate of anti-SARS-CoV-2 activities. The web application reports estimated activity across three areas (viral entry, viral replication, and live virus infectivity) spanning six independent models, followed by a similarity search that displays the most similar molecules to the query among experimentally determined data. The ML models have 60% to 74% external predictivity, based on three separate datasets. Complementing the NCATS COVID19 portal, REDIAL-2020 can serve as a rapid online tool for identifying active molecules for COVID-19 treatment. The source code and specific models are available through Github (https://github.com/sirimullalab/redial-2020), or via Docker Hub (https://hub.docker.com/r/sirimullalab/redial-2020) for users preferring a containerized version.


Sign in / Sign up

Export Citation Format

Share Document