scholarly journals Estimating Real World Performance of a Predictive Model: A Case-Study in Predicting End-of-Life

2019 ◽  
Author(s):  
Vincent J Major ◽  
Neil Jethani ◽  
Yindalon Aphinyanaphongs

AbstractObjectiveThe main criteria for choosing how models are built is the subsequent effect on future (estimated) model performance. In this work, we evaluate the effects of experimental design choices on both estimated and actual model performance.Materials and MethodsFour years of hospital admissions are used to develop a 1 year end-of-life prediction model. Two common methods to select appropriate prediction timepoints (backwards-from-outcome and forwards-from-admission) are introduced and combined with two ways of separating cohorts for training and testing (internal and temporal). Two models are trained in identical conditions, and their performances are compared. Finally, operating thresholds are selected in each test set and applied in a final, ‘real-world’ cohort consisting of one year of admissions.ResultsBackwards-from-outcome cohort selection discards 75% of candidate admissions (n=23,579), whereas forwards-from-admission selection includes many more (n=92,148). Both selection methods produce similar global performances when applied to an internal test set. However, when applied to the temporally defined ‘real-world’ set, forwards-from-admission yields higher areas under the ROC and precision recall curves (88.3 and 56.5% vs. 83.2 and 41.6%).DiscussionA backwards-from-outcome experiment effectively transforms the training data such that it no longer resembles real-world data. This results in optimistic estimates of test set performance, especially at high precision. In contrast, a forwards-from-admission experiment with a temporally separated test set consistently and conservatively estimates real-world performance.ConclusionExperimental design choices impose bias upon selected cohorts. A forwards-from-admission experiment, validated temporally, can conservatively estimate real-world performance.

JAMIA Open ◽  
2020 ◽  
Vol 3 (2) ◽  
pp. 243-251
Author(s):  
Vincent J Major ◽  
Neil Jethani ◽  
Yindalon Aphinyanaphongs

Abstract Objective One primary consideration when developing predictive models is downstream effects on future model performance. We conduct experiments to quantify the effects of experimental design choices, namely cohort selection and internal validation methods, on (estimated) real-world model performance. Materials and Methods Four years of hospitalizations are used to develop a 1-year mortality prediction model (composite of death or initiation of hospice care). Two common methods to select appropriate patient visits from their encounter history (backwards-from-outcome and forwards-from-admission) are combined with 2 testing cohorts (random and temporal validation). Two models are trained under otherwise identical conditions, and their performances compared. Operating thresholds are selected in each test set and applied to a “real-world” cohort of labeled admissions from another, unused year. Results Backwards-from-outcome cohort selection retains 25% of candidate admissions (n = 23 579), whereas forwards-from-admission selection includes many more (n = 92 148). Both selection methods produce similar performances when applied to a random test set. However, when applied to the temporally defined “real-world” set, forwards-from-admission yields higher areas under the ROC and precision recall curves (88.3% and 56.5% vs. 83.2% and 41.6%). Discussion A backwards-from-outcome experiment manipulates raw training data, simplifying the experiment. This manipulated data no longer resembles real-world data, resulting in optimistic estimates of test set performance, especially at high precision. In contrast, a forwards-from-admission experiment with a temporally separated test set consistently and conservatively estimates real-world performance. Conclusion Experimental design choices impose bias upon selected cohorts. A forwards-from-admission experiment, validated temporally, can conservatively estimate real-world performance. LAY SUMMARY The routine care of patients stands to benefit greatly from assistive technologies, including data-driven risk assessment. Already, many different machine learning and artificial intelligence applications are being developed from complex electronic health record data. To overcome challenges that arise from such data, researchers often start with simple experimental approaches to test their work. One key component is how patients (and their healthcare visits) are selected for the study from the pool of all patients seen. Another is how the group of patients used to create the risk estimator differs from the group used to evaluate how well it works. These choices complicate how the experimental setting compares to the real-world application to patients. For example, different selection approaches that depend on each patient’s future outcome can simplify the experiment but are impractical upon implementation as these data are unavailable. We show that this kind of “backwards” experiment optimistically estimates how well the model performs. Instead, our results advocate for experiments that select patients in a “forwards” manner and “temporal” validation that approximates training on past data and implementing on future data. More robust results help gauge the clinical utility of recent works and aid decision-making before implementation into practice.


2020 ◽  
Author(s):  
Jenna M Reps ◽  
Peter Rijnbeek ◽  
Alana Cuthbert ◽  
Patrick B Ryan ◽  
Nicole Pratt ◽  
...  

Abstract Background: Researchers developing prediction models are faced with numerous design choices that may impact model performance. One key decision is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up.Methods: We generate a partially synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. In addition to our synthetic data study we investigate 21 real-world data prediction problems. We compare four simple strategies for developing models when using a cohort design that encounters loss to follow-up. Three strategies employ a binary classifier with data that: i) include all patients (including those lost to follow-up), ii) exclude all patients lost to follow-up or iii) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. We empirically evaluate the discrimination and calibration performance.Results: The partially synthetic data study results show that excluding patients who are lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on the model performance. Our empirical real-world data results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year, but demonstrated differential bias when we looked into 3-year time-at-risk. Removing patients who are lost to follow-up before experiencing the outcome but keeping patients who are lost to follow-up after the outcome can bias a model and should be avoided.Conclusion: Based on this study we therefore recommend i) developing models using data that includes patients that are lost to follow-up and ii) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.


2021 ◽  
Vol 39 (28_suppl) ◽  
pp. 253-253
Author(s):  
Maureen Canavan ◽  
Xiaoliang Wang ◽  
Mustafa Ascha ◽  
Rebecca A. Miksad ◽  
Timothy N Showalter ◽  
...  

253 Background: Among patients with cancer, receipt of systemic oncolytic therapy near the end-of-life (EOL) does not improve outcomes and worsens patient and caregiver experience. Accordingly, the ASCO/NQF measure, Proportion Receiving Chemotherapy in the Last 14 Days of Life, was published in 2012. Over the last decade there has been exponential growth in high cost targeted and immune therapies which may be perceived as less toxic than traditional chemotherapy. In this study, we identified rates and types of EOL systemic therapy in today’s real-world practice; these can serve as benchmarks for cancer care organizations to drive improvement efforts. Methods: Using data from the nationwide Flatiron Health electronic health record (EHR)-derived de-identified database we included patients who died during 2015 through 2019, were diagnosed after 2011, and who had documented cancer treatment. We identified the use of aggressive EOL systemic treatment (including, chemotherapy, immunotherapy, and combinations thereof) at both 30 days and 14 days prior to death. We estimated standardized EOL rates using mixed-level logistic regression models adjusting for patient and practice-level factors. Year-specific adjusted rates were estimated in annualized stratified analysis. Results: We included 57,127 patients, 38% of whom had documentation of having received any type of systemic cancer treatment within 30 days of death (SD: 5%; range: 25% - 56%), and 17% within 14 days of death (SD: 3%; range: 10% - 30%). Chemotherapy alone was the most common EOL treatment received (18% at 30 days, 8% at 14 days), followed by immunotherapy (± other treatment) (11% at 30 days, 4% at 14 days). Overall rates of EOL treatment did not change over the study period: treatment within 30 days (39% in 2015 to 37% in 2019) and within 14 days (17% in 2015 to 17% in 2019) of death. However, the rates of chemotherapy alone within 30 days of death decreased from 24% to 14%, and within 14 days, from 10% to 6% during the study period. In comparison, rates for immunotherapy with chemotherapy (0%-6% for 30 days, 0% -2% for 14 days), and immunotherapy alone or with other treatment types (4%-13% for 30 days, 1%-4% for 14 days) increased over time for both 30 and 14 days. Conclusions: End of life systemic cancer treatment rates have not substantively changed over time despite national efforts and expert guidance. While rates of traditional chemotherapy have decreased, rates of costly immunotherapy and targeted therapy have increased, which has been associated with higher total cost of care and overall healthcare utilization. Future work should examine the drivers of end-of-life care in the era of immune-oncology.


Author(s):  
Martyna Bogacz ◽  
Stephane Hess ◽  
Chiara Calastri ◽  
Charisma F. Choudhury ◽  
Alexander Erath ◽  
...  

The use of virtual reality (VR) in transport research offers the opportunity to collect behavioral data in a controlled dynamic setting. VR settings are useful in the context of hypothetical situations in which real-world data does not exist or in situations which involve risk and safety issues making real-world data collection infeasible. Nevertheless, VR studies can contribute to transport-related research only if the behavior elicited in a virtual environment closely resembles real-world behavior. Importantly, as VR is a relatively new research tool, the best-practice with regards to the experimental design is still to be established. In this paper, we contribute to a better understanding of the implications of the choice of the experimental setup by comparing cycling behavior in VR between two groups of participants in similar immersive scenarios, the first group controlling the maneuvers using a keyboard and the other group riding an instrumented bicycle. We critically compare the speed, acceleration, braking and head movements of the participants in the two experiments. We also collect electroencephalography (EEG) data to compare the alpha wave amplitudes and assess the engagement levels of participants in the two settings. The results demonstrate the ability of VR to elicit behavioral patterns in line with those observed in the real-world and indicate the importance of the experimental design in a VR environment beyond the choice of audio-visual stimuli. The findings will be useful for researchers in designing the experimental setup of VR for behavioral data collection.


2020 ◽  
Author(s):  
Jenna M Reps ◽  
Peter Rijnbeek ◽  
Alana Cuthbert ◽  
Patrick B Ryan ◽  
Nicole Pratt ◽  
...  

Abstract Background: Researchers developing prediction models are faced with numerous design choices that may impact model performance. One key decision is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up.Methods: We generate a partially synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. In addition to our synthetic data study we investigate 21 real-world data prediction problems. We compare four simple strategies for developing models when using a cohort design that encounters loss to follow-up. Three strategies employ a binary classifier with data that: i) include all patients (including those lost to follow-up), ii) exclude all patients lost to follow-up or iii) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. We empirically evaluate the discrimination and calibration performance.Results: The partially synthetic data study results show that excluding patients who are lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on model discrimination performance. Our empirical real-world data results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year but demonstrated differential bias when we looked into 3-year time-at-risk. Removing patients who are lost to follow-up before experiencing the outcome but keeping patients who are lost to follow-up after the outcome can bias a model and should be avoided.Conclusion: Based on this study we therefore recommend i) developing models using data that includes patients that are lost to follow-up and ii) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.


2019 ◽  
Vol 8 (1) ◽  
pp. 20-31 ◽  
Author(s):  
Kamran Iqbal ◽  
Kate Halsby ◽  
Robert D Murray ◽  
Paul V Carroll ◽  
Robert Petermann

Background and objectives Glucocorticoids are used to manage adrenal insufficiency (AI). We describe treatments used in the United Kingdom and real-world clinical outcomes for each treatment. Methods We used 2010–2016 primary care data from The Health Improvement Network (THIN). Descriptive analyses were conducted, and differences in variables between patients prescribed immediate-release hydrocortisone (IR HC), prednisolone or modified-release hydrocortisone (MR HC) were assessed using Fisher’s exact test. Results Overall, 2648 patients were included: 1912 on IR HC (72%), 691 on prednisolone (26%) and 45 (2%) on MR HC. A total of 1174 (44.3%) had primary and 1150 (43.4%) had secondary AI. Patients on prednisolone were older (P < 0.001) and had a greater history of smoking (292/691, P < 0.001) and CVD (275/691, P < 0.001). Patients on MR HC had more PCOS (3/45, P = 0.001) and diabetes (27/45, P = 0.004). The number of GP visits/patient/year was 6.50 in IR HC, 9.54 in prednisolone and 9.11 in MR HC cohorts. The mean number of A&E visits and inpatient and outpatient hospital admissions ranged from 0.42 to 0.93 visits/patient/year. The mean number of adrenal crises/patient/year was between 0.02 and 0.03 for all cohorts. Conclusion IR HC is most commonly used for the management of AI in the United Kingdom, followed by prednisolone. Few patients receive MR HC. The prednisolone and MR HC cohorts displayed a greater prevalence of vascular risk factors compared with IR HC. The occurrence of AC and primary and secondary resource use were similar between treatment cohorts, and they indicate significant resource utilisation. Improved treatment and management of patients with AI is needed.


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. e20002-e20002
Author(s):  
Li Zhou ◽  
Rob Steen ◽  
Lynn Lu

e20002 Background: Identifying optimal therapy options can help maximize treatment outcomes. Finding ways to help improve treatment decision is of great value to achieve better patient care. With the availability of robust patient real world data and the application of state of the art Artificial Intelligence and Machine Learning (AIML) technology, new opportunities have emerged for a broad spectrum of research needs from oncology R&D to commercialization. To illustrate the above advancements, this study identified patients diagnosed with CLL who may progress to next line of treatment in the near future (e.g. future 3 months). More importantly, we can identify treatment patterns which are more effective in treating different types of CLL patients. Methods: This study includes multiple steps which have already been analyzed for feasibility: 1. Collect CLL patients. IQVIA's real world data contains ~60,000 active CLL treated patients. ~2,000 patients have progressed line of treatment in 3 month. 2. Define patients into positive and negative cohorts based on those who have/have not advanced to line L2+. 3. Determine patient profiles based on treatment regimens, symptoms, lab tests, doctor visits, hospital visits, and co-morbidity, etc. 4. Select patient and treatment features to fit an AIML predictive model. 5. Test different algorithms to achieve best model results and validate model performance. 6. Score and classify CLL patients into high and low probability based on the predictive model. 7. Match patients based on feature importance and compare regimens between positive and negative cohort. Results: Model accuracy is above 90%. Top clinical features are calculated for each patient. Optimum treatment patterns between high and low probability patients are identified, with controlling patient key features. Conclusions: Conclusions from this study is expected to yield deeper insight into more tailored treatments by patient type. CLL patients started with oral therapy(targeting) have better response than other treatments.


Diabetes ◽  
2018 ◽  
Vol 67 (Supplement 1) ◽  
pp. 1264-P
Author(s):  
REEMA MODY ◽  
QING HUANG ◽  
MARIA YU ◽  
RUIZHI ZHAO ◽  
HIREN PATEL ◽  
...  

Author(s):  
Boyang Li ◽  
◽  
Jinglu Hu ◽  
Kotaro Hirasawa

We propose an improved support vector machine (SVM) classifier by introducing a new offset, for solving the real-world unbalanced classification problem. The new offset is calculated based on the unbalanced support vectors resulting from the unbalanced training data. We developed a weighted harmonic mean (WHM) algorithm to further reduce the effects of noise on offset calculation. We apply the proposed approach to classify real-world data. Results of simulation demonstrate the effectiveness of our proposed approach.


2018 ◽  
Vol 210 ◽  
pp. 04019 ◽  
Author(s):  
Hyontai SUG

Recent world events in go games between human and artificial intelligence called AlphaGo showed the big advancement in machine learning technologies. While AlphaGo was trained using real world data, AlphaGo Zero was trained using massive random data, and the fact that AlphaGo Zero won AlphaGo completely revealed that diversity and size in training data is important for better performance for the machine learning algorithms, especially in deep learning algorithms of neural networks. On the other hand, artificial neural networks and decision trees are widely accepted machine learning algorithms because of their robustness in errors and comprehensibility respectively. In this paper in order to prove that diversity and size in data are important factors for better performance of machine learning algorithms empirically, the two representative algorithms are used for experiment. A real world data set called breast tissue was chosen, because the data set consists of real numbers that is very good property for artificial random data generation. The result of the experiment proved the fact that the diversity and size of data are very important factors for better performance.


Sign in / Sign up

Export Citation Format

Share Document