Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data

Abstract Objective To develop a conceptual prediction model framework containing standardized steps and describe the corresponding open-source software developed to consistently implement the framework across computational environments and observational healthcare databases to enable model sharing and reproducibility. Methods Based on existing best practices we propose a 5 step standardized framework for: (1) transparently defining the problem; (2) selecting suitable datasets; (3) constructing variables from the observational data; (4) learning the predictive model; and (5) validating the model performance. We implemented this framework as open-source software utilizing the Observational Medical Outcomes Partnership Common Data Model to enable convenient sharing of models and reproduction of model evaluation across multiple observational datasets. The software implementation contains default covariates and classifiers but the framework enables customization and extension. Results As a proof-of-concept, demonstrating the transparency and ease of model dissemination using the software, we developed prediction models for 21 different outcomes within a target population of people suffering from depression across 4 observational databases. All 84 models are available in an accessible online repository to be implemented by anyone with access to an observational database in the Common Data Model format. Conclusions The proof-of-concept study illustrates the framework’s ability to develop reproducible models that can be readily shared and offers the potential to perform extensive external validation of models, and improve their likelihood of clinical uptake. In future work the framework will be applied to perform an “all-by-all” prediction analysis to assess the observational data prediction domain across numerous target populations, outcomes and time, and risk settings.

Download Full-text

Development of Prediction Models for Unplanned Hospital Readmission within 30 Days Based on Common Data Model: A Feasibility Study

Methods of Information in Medicine ◽

10.1055/s-0041-1735166 ◽

2021 ◽

Author(s):

Sooyoung Yoo ◽

Jinwook Choi ◽

Borim Ryu ◽

Seok Kim

Keyword(s):

Machine Learning ◽

Prediction Model ◽

Data Model ◽

Hospital Readmission ◽

Prediction Models ◽

External Validation ◽

Gradient Boosting ◽

Common Data Model ◽

Time Interval ◽

Machine Learning Classification

Abstract Background Unplanned hospital readmission after discharge reflects low satisfaction and reliability in care and the possibility of potential medical accidents, and is thus indicative of the quality of patient care and the appropriateness of discharge plans. Objectives The purpose of this study was to develop and validate prediction models for all-cause unplanned hospital readmissions within 30 days of discharge, based on a common data model (CDM), which can be applied to multiple institutions for efficient readmission management. Methods Retrospective patient-level prediction models were developed based on clinical data of two tertiary general university hospitals converted into a CDM developed by Observational Medical Outcomes Partnership. Machine learning classification models based on the LASSO logistic regression model, decision tree, AdaBoost, random forest, and gradient boosting machine (GBM) were developed and tested by manipulating a set of CDM variables. An internal 10-fold cross-validation was performed on the target data of the model. To examine its transportability, the model was externally validated. Verification indicators helped evaluate the model performance based on the values of area under the curve (AUC). Results Based on the time interval for outcome prediction, it was confirmed that the prediction model targeting the variables obtained within 30 days of discharge was the most efficient (AUC of 82.75). The external validation showed that the model is transferable, with the combination of various clinical covariates. Above all, the prediction model based on the GBM showed the highest AUC performance of 84.14 ± 0.015 for the Seoul National University Hospital cohort, yielding in 78.33 in external validation. Conclusions This study showed that readmission prediction models developed using machine-learning techniques and CDM can be a useful tool to compare two hospitals in terms of patient-data features.

Download Full-text

Predicting Change Prone Classes in Open Source Software

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018100101 ◽

2018 ◽

Vol 8 (4) ◽

pp. 1-23 ◽

Cited By ~ 2

Author(s):

Deepa Godara ◽

Amit Choudhary ◽

Rakesh Kumar Singh

Keyword(s):

Prediction Model ◽

Open Source ◽

Open Source Software ◽

Prediction Models ◽

New Technology ◽

Modern Technology ◽

Time Frequency ◽

Rigorous Testing ◽

Technology Changes ◽

Sensitivity Specificity

In today's world, the heart of modern technology is software. In order to compete with pace of new technology, changes in software are inevitable. This article aims at the association between changes and object-oriented metrics using different versions of open source software. Change prediction models can detect the probability of change in a class earlier in the software life cycle which would result in better effort allocation, more rigorous testing and easier maintenance of any software. Earlier, researchers have used various techniques such as statistical methods for the prediction of change-prone classes. In this article, some new metrics such as execution time, frequency, run time information, popularity and class dependency are proposed which can help in prediction of change prone classes. For evaluating the performance of the prediction model, the authors used Sensitivity, Specificity, and ROC Curve. Higher values of AUC indicate the prediction model gives significant accurate results. The proposed metrics contribute to the accurate prediction of change-prone classes.

Download Full-text

Demography of Open Source Software Prediction Models and Techniques

Research Anthology on Usage and Development of Open Source Software ◽

10.4018/978-1-7998-9158-1.ch033 ◽

2021 ◽

pp. 620-652

Author(s):

Kaniz Fatema ◽

M. M. Mahbubul Syeed ◽

Imed Hammouda

Keyword(s):

Open Source ◽

Open Source Software ◽

Preventive Maintenance ◽

Prediction Models ◽

Journal Article ◽

Literature Survey ◽

Future Research ◽

Fundamental Knowledge ◽

Future Research Directions ◽

Insight Into

Open source software (OSS) is currently a widely adopted approach to developing and distributing software. Many commercial companies are using OSS components as part of their product development. For instance, more than 58% of web servers are using an OSS web server, Apache. For effective adoption of OSS, fundamental knowledge of project development is needed. This often calls for reliable prediction models to simulate project evolution and to envision project future. These models provide help in supporting preventive maintenance and building quality software. This chapter reports on a systematic literature survey aimed at the identification and structuring of research that offers prediction models and techniques in analysing OSS projects. The study outcome provides insight into what constitutes the main contributions of the field, identifies gaps and opportunities, and distils several important future research directions. This chapter extends the authors' earlier journal article and offers the following improvements: broader study period, enhanced discussion, and synthesis of reported results.

Download Full-text

Harmonization of clinical data across Gen3 data commons.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.e18094 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. e18094-e18094 ◽

Cited By ~ 1

Author(s):

LaRon Hughes ◽

Robert L. Grossman ◽

Zachary Flamig ◽

Andrew Prokhorenkov ◽

Michael Lukowski ◽

...

Keyword(s):

Open Source ◽

Clinical Data ◽

Open Source Software ◽

Data Model ◽

Reference Standards ◽

Data Dictionary ◽

Software Platform ◽

Operating Data ◽

External Reference ◽

Data Commons

e18094 Background: Gen3 is an open source software platform for developing and operating data commons. Gen3 systems are now used by a variety of institutions and agencies to share and analyze large biomedical datasets including clinical and genomic data. One of the challenges of working with these datasets is disparate clinical data standards used by researchers across different studies and fields. We have worked to address these hurdles in a variety of ways. Methods: Gen3 is an open source software platform for developing and operating data commons. Detailed specification and features can be found at https://gen3.org/ with code located on GitHub ( https://github.com/UC-cdis ). Results: The Gen3 data model is a graphical representation of the different nodes or classes of data that have been collected. Examples include diagnosis, demographic, exposure, and family history. The properties and values on each node are controlled by the data dictionary specified by the data commons creator. While each commons may have a unique data model and dictionary, specifying external standards allows for easier submission of new data and assists data consumers with interpretation of results. A variety of external references can be supported, but here we demonstrate the use of the National Cancer Institute Thesaurus (NCIt). NCIt provides reference terminologies and biomedical standards that contain a rich set of terms, codes, definitions, and concepts. Using the same reference standards across commons allows for the export of clinical data between commons. The Portable Format for Biomedical Data (PFB) was created to facilitate data export and to allow the data dictionary schema as well as the raw data to be compressed and exported. This new file format, which utilizes an Avro serialization, is small, fast, easy to modify, and enables simple data export and import. PFB also has the ability to house entire external reference ontologies and it is easy to update the PFB references as changes are introduced. Conclusions: We have shown here how the Gen3 data model, use of external reference standards for clinical data, and the export/import format of PFB enable the harmonization of clinical data across different data commons.

Download Full-text

Predicting poor outcome in patients with suspected COVID-19 presenting to the Emergency Department (COVERED) – Development, internal and external validation of a prediction model

Acute Medicine Journal ◽

10.52964/amja.0836 ◽

2021 ◽

Vol 20 (1) ◽

pp. 4-14

Author(s):

K. Azijli ◽

◽

A.W.E. Lieveld ◽

S.F.B. van der Horst ◽

N. de Graaf ◽

...

Keyword(s):

Emergency Department ◽

Prediction Model ◽

Clinical Decision Making ◽

Prediction Models ◽

External Validation ◽

Model Performance ◽

Area Under The Curve ◽

Brier Score ◽

Multi Centre Study ◽

Poor Outcome

Background: A recent systematic review recommends against the use of any of the current COVID-19 prediction models in clinical practice. To enable clinicians to appropriately profile and treat suspected COVID-19 patients at the emergency department (ED), externally validated models that predict poor outcome are desperately needed. Objective: Our aims were to identify predictors of poor outcome, defined as mortality or ICU admission within 30 days, in patients presenting to the ED with a clinical suspicion of COVID-19, and to develop and externally validate a prediction model for poor outcome. Methods: In this prospective, multi-centre study, we enrolled suspected COVID-19 patients presenting at the EDs of two hospitals in the Netherlands. We used backward logistic regression to develop a prediction model. We used the area under the curve (AUC), Brier score and pseudo-R2 to assess model performance. The model was externally validated in an Italian cohort. Results: We included 1193 patients between March 12 and May 27 2020, of whom 196 (16.4%) had a poor outcome. We identified 10 predictors of poor outcome: current malignancy (OR 2.774; 95%CI 1.682-4.576), systolic blood pressure (OR 0.981; 95%CI 0.964-0.998), heart rate (OR 1.001; 95%CI 0.97-1.028), respiratory rate (OR 1.078; 95%CI 1.046-1.111), oxygen saturation (OR 0.899; 95%CI 0.850-0.952), body temperature (OR 0.505; 95%CI 0.359-0.710), serum urea (OR 1.404; 95%CI 1.198-1.645), C-reactive protein (OR 1.013; 95%CI 1.001-1.024), lactate dehydrogenase (OR 1.007; 95%CI 1.002-1.013) and SARS-CoV-2 PCR result (OR 2.456; 95%CI 1.526-3.953). The AUC was 0.86 (95%CI 0.83-0.89), with a Brier score of 0.32 and, and R2 of 0.41. The AUC in the external validation in 500 patients was 0.70 (95%CI 0.65-0.75). Conclusion: The COVERED risk score showed excellent discriminatory ability, also in external validation. It may aid clinical decision making, and improve triage at the ED in health care environments with high patient throughputs.

Download Full-text

Open Source Software Reliability Model with the Decreasing Trend of Fault Detection Rate

The Computer Journal ◽

10.1093/comjnl/bxy111 ◽

2018 ◽

Vol 62 (9) ◽

pp. 1301-1312

Author(s):

Jinyong Wang ◽

Xiaoping Mi

Keyword(s):

Fault Detection ◽

Open Source ◽

Open Source Software ◽

Software Reliability ◽

Real World ◽

Detection Rate ◽

Failure Process ◽

Model Performance ◽

Predictive Performance ◽

Proposed Model

Abstract Software reliability assessment methods have been changed from closed to open source software (OSS). Although numerous new approaches for improving OSS reliability are formulated, they are not used in practice due to their inaccuracy. A new proposed model considering the decreasing trend of fault detection rate is developed in this study to effectively improve OSS reliability. We analyse the changes of the instantaneous fault detection rate over time by using real-world software fault count data from two actual OSS projects, namely, Apache and GNOME, to validate the proposed model performance. Results show that the proposed model with the decreasing trend of fault detection rate has better fitting and predictive performance than the traditional closed source software and other OSS reliability models. The proposed model for OSS can further accurately fit and predict the failure process and thus can assist in improving the quality of OSS systems in real-world OSS projects.

Download Full-text

Kidney Failure Prediction Models: A Comprehensive External Validation Study in Patients with Advanced CKD

Journal of the American Society of Nephrology ◽

10.1681/asn.2020071077 ◽

2021 ◽

pp. ASN.2020071077

Author(s):

Chava L. Ramspek ◽

Marie Evans ◽

Christoph Wanner ◽

Christiane Drechsler ◽

Nicholas C. Chesnaye ◽

...

Keyword(s):

Kidney Failure ◽

Prediction Models ◽

External Validation ◽

Model Performance ◽

Time Frame ◽

Competing Risk ◽

Multicenter Cohort Study ◽

Risk Equation ◽

Risk Of Death ◽

Competing Events

BackgroundVarious prediction models have been developed to predict the risk of kidney failure in patients with CKD. However, guideline-recommended models have yet to be compared head to head, their validation in patients with advanced CKD is lacking, and most do not account for competing risks.MethodsTo externally validate 11 existing models of kidney failure, taking the competing risk of death into account, we included patients with advanced CKD from two large cohorts: the European Quality Study (EQUAL), an ongoing European prospective, multicenter cohort study of older patients with advanced CKD, and the Swedish Renal Registry (SRR), an ongoing registry of nephrology-referred patients with CKD in Sweden. The outcome of the models was kidney failure (defined as RRT-treated ESKD). We assessed model performance with discrimination and calibration.ResultsThe study included 1580 patients from EQUAL and 13,489 patients from SRR. The average c statistic over the 11 validated models was 0.74 in EQUAL and 0.80 in SRR, compared with 0.89 in previous validations. Most models with longer prediction horizons overestimated the risk of kidney failure considerably. The 5-year Kidney Failure Risk Equation (KFRE) overpredicted risk by 10%–18%. The four- and eight-variable 2-year KFRE and the 4-year Grams model showed excellent calibration and good discrimination in both cohorts.ConclusionsSome existing models can accurately predict kidney failure in patients with advanced CKD. KFRE performed well for a shorter time frame (2 years), despite not accounting for competing events. Models predicting over a longer time frame (5 years) overestimated risk because of the competing risk of death. The Grams model, which accounts for the latter, is suitable for longer-term predictions (4 years).

Download Full-text

On Missingness Features in Machine Learning Models for Critical Care: Observational Study (Preprint)

10.2196/preprints.25022 ◽

2020 ◽

Author(s):

Janmajay Singh ◽

Masahiro Sato ◽

Tomoko Ohkuma

Keyword(s):

Machine Learning ◽

Length Of Stay ◽

Electronic Health Records ◽

Prediction Models ◽

External Validation ◽

Model Performance ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

BACKGROUND Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. OBJECTIVE The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. METHODS A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. RESULTS Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. CONCLUSIONS This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.

Download Full-text

Demography of Open Source Software Prediction Models and Techniques

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Optimizing Contemporary Application and Processes in Open Source Software ◽

10.4018/978-1-5225-5314-4.ch002 ◽

2018 ◽

pp. 24-56

Author(s):

Kaniz Fatema ◽

M. M. Mahbubul Syeed ◽

Imed Hammouda

Keyword(s):

Open Source ◽

Open Source Software ◽

Preventive Maintenance ◽

Prediction Models ◽

Journal Article ◽

Literature Survey ◽

Future Research ◽

Fundamental Knowledge ◽

Future Research Directions ◽

Insight Into

Download Full-text

Impact of Historical Software Metric Changes in Predicting Future Maintainability Trends in Open-Source Software Development

Applied Sciences ◽

10.3390/app10134624 ◽

2020 ◽

Vol 10 (13) ◽

pp. 4624

Author(s):

Mitja Gradišnik ◽

Tina Beranič ◽

Sašo Karakatič

Keyword(s):

Open Source ◽

Open Source Software ◽

Software Maintenance ◽

Software Metrics ◽

Prediction Models ◽

Measurement Data ◽

Software Project ◽

Software Projects ◽

Software Metric ◽

Software Maintainability

Software maintenance is one of the key stages in the software lifecycle and it includes a variety of activities that consume the significant portion of the costs of a software project. Previous research suggest that future software maintainability can be predicted, based on various source code aspects, but most of the research focuses on the prediction based on the present state of the code and ignores its history. While taking the history into account in software maintainability prediction seems intuitive, the research empirically testing this has not been done, and is the main goal of this paper. This paper empirically evaluates the contribution of historical measurements of the Chidamber & Kemerer (C&K) software metrics to software maintainability prediction models. The main contribution of the paper is the building of the prediction models with classification and regression trees and random forest learners in iterations by adding historical measurement data extracted from previous releases gradually. The maintainability prediction models were built based on software metric measurements obtained from real-world open-source software projects. The analysis of the results show that an additional amount of historical metric measurements contributes to the maintainability prediction. Additionally, the study evaluates the contribution of individual C&K software metrics on the performance of maintainability prediction models.

Download Full-text