scholarly journals Missing Data in Longitudinal Covariates in Building a Cox Prediction Model: Overview and Some Practical Guidance

Author(s):  
Joseph Kang
2020 ◽  
Vol 5 ◽  
pp. 50
Author(s):  
Luke Daines ◽  
Laura J. Bonnett ◽  
Andy Boyd ◽  
Steve Turner ◽  
Steff Lewis ◽  
...  

Background: Accurately diagnosing asthma can be challenging. Uncertainty about the best combination of clinical features and investigations for asthma diagnosis is reflected in conflicting recommendations from international guidelines. One solution could be a clinical prediction model to support health professionals estimate the probability of an asthma diagnosis. However, systematic review evidence identifies that existing models for asthma diagnosis are at high risk of bias and unsuitable for clinical use. Being mindful of previous limitations, this protocol describes plans to derive and validate a prediction model for use by healthcare professionals to aid diagnostic decision making during assessment of a child or young person with symptoms suggestive of asthma in primary care. Methods: A prediction model will be derived using data from the Avon Longitudinal Study of Parents and Children (ALSPAC) and linked primary care electronic health records (EHR). Data will be included from study participants up to 25 years of age where permissions exist to use their linked EHR. Participants will be identified as having asthma if they received at least three prescriptions for an inhaled corticosteroid within a one-year period and have an asthma code in their EHR. To deal with missing data we will consider conducting a complete case analysis. However, if the exclusion of cases with missing data substantially reduces the total sample size, multiple imputation will be used. A multivariable logistic regression model will be fitted with backward stepwise selection of candidate predictors.  Apparent model performance will be assessed before internal validation using bootstrapping techniques. The model will be adjusted for optimism before external validation in a dataset created from the Optimum Patient Care Research Database. Discussion: This protocol describes a robust strategy for the derivation and validation of a prediction model to support the diagnosis of asthma in children and young people in primary care.


2019 ◽  
Vol 15 (1) ◽  
pp. 13-17
Author(s):  
Nurul Latiffah Abd Rani ◽  
Azman Azid ◽  
Muhamad Shirwan Abdullah Sani ◽  
Mohd Saiful Samsudin ◽  
Ku Mohd Kalkausar Ku Yusof ◽  
...  

Carbon monoxide (CO) is one of the most important pollutants since it is selected for API calculation. Therefore, it is paramount to ensure that there is no missing data of CO during the analysis. There are numbers of occurrences that may contribute to the missing data problems such as inability of the instrument to record certain parameters. In view of this fact, a CO prediction model needs to be developed to address this problem. A dataset of meteorological and air pollutants value was obtained from the Air Quality Division, Department of Environment Malaysia (DOE). A total of 113112 datasets were used to develop the model using sensitivity analysis (SA) through artificial neural network (ANN). SA showed particulate matter (PM10) and ozone (O3) were the most significant input variables for missing data prediction model of CO. Three hidden nodes were the optimum number to develop the ANN model with the value of R2 equal to 0.5311. Both models (artificial neural network-carbon monoxide-all parameters (ANN-CO-AP) and artificial neural network-carbon monoxide-leave out (ANN-CO-LO)) showed high value of R2 (0.7639 and 0.5311) and low value of RMSE (0.2482 and 0.3506), respectively. These values indicated that the models might only employ the most significant input variables to represent the CO rather than using all input variables.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Peyman Almasinejad ◽  
Amin Golabpour ◽  
Mohammad Reza Mollakhalili Meybodi ◽  
Kamal Mirzaie ◽  
Ahmad Khosravi

Missing data occurs in all research, especially in medical studies. Missing data is the situation in which a part of research data has not been reported. This will result in the incompatibility of the sample and the population and misguided conclusions. Missing data is usual in research, and the extent of it will determine how misinterpreted the conclusions will be. All methods of parameter estimation and prediction models are based on the assumption that the data are complete. Extensive missing data will result in false predictions and increased bias. In the present study, a novel method has been proposed for the imputation of medical missing data. The method determines what algorithm is suitable for the imputation of missing data. To do so, a multiobjective particle swarm optimization algorithm was used. The algorithm imputes the missing data in a way that if a prediction model is applied to the data, both specificity and sensitivity will be optimized. Our proposed model was evaluated using real data of gastric cancer and acute T-cell leukemia (ATLL). First, the model was then used to impute the missing data. Then, the missing data were imputed using deletion, average, expectation maximization, MICE, and missForest methods. Finally, the prediction model was applied for both imputed datasets. The accuracy of the prediction model for the first and the second imputation methods was 0.5 and 16.5, respectively. The novel imputation method was more accurate than similar algorithms like expectation maximization and MICE.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Ling Jiang ◽  
Tingsheng Zhao ◽  
Chuxuan Feng ◽  
Wei Zhang

PurposeThis research is aimed at predicting tower crane accident phases with incomplete data.Design/methodology/approachThe tower crane accidents are collected for prediction model training. Random forest (RF) is used to conduct prediction. When there are missing values in the new inputs, they should be filled in advance. Nevertheless, it is difficult to collect complete data on construction site. Thus, the authors use multiple imputation (MI) method to improve RF. Finally the prediction model is applied to a case study.FindingsThe results show that multiple imputation RF (MIRF) can effectively predict tower crane accident when the data are incomplete. This research provides the importance rank of tower crane safety factors. The critical factors should be focused on site, because the missing data affect the prediction results seriously. Also the value of critical factors influences the safety of tower crane.Practical implicationThis research promotes the application of machine learning methods for accident prediction in actual projects. According to the onsite data, the authors can predict the accident phase of tower crane. The results can be used for tower crane accident prevention.Originality/valuePrevious studies have seldom predicted tower crane accidents, especially the phase of accident. This research uses tower crane data collected on site to predict the phase of the tower crane accident. The incomplete data collection is considered in this research according to the actual situation.


2017 ◽  
Vol 25 (3) ◽  
pp. 951-959 ◽  
Author(s):  
Gregor Stiglic ◽  
Primoz Kocbek ◽  
Nino Fijacko ◽  
Aziz Sheikh ◽  
Majda Pajnkihar

The increasing availability of data stored in electronic health records brings substantial opportunities for advancing patient care and population health. This is, however, fundamentally dependant on the completeness and quality of data in these electronic health records. We sought to use electronic health record data to populate a risk prediction model for identifying patients with undiagnosed type 2 diabetes mellitus. We, however, found substantial (up to 90%) amounts of missing data in some healthcare centres. Attempts at imputing for these missing data or using reduced dataset by removing incomplete records resulted in a major deterioration in the performance of the prediction model. This case study illustrates the substantial wasted opportunities resulting from incomplete records by simulation of missing and incomplete records in predictive modelling process. Government and professional bodies need to prioritise efforts to address these data shortcomings in order to ensure that electronic health record data are maximally exploited for patient and population benefit.


2021 ◽  
Vol 50 (Supplement_1) ◽  
Author(s):  
Katherine Lee ◽  
James Carpenter ◽  
Roderick Little ◽  
Cattram Nguyen ◽  
Rosie Cornish

Abstract Focus and outcomes for participants Missing data are ubiquitous in observational studies, and the simple solution of restricting the analyses to the subset with complete records will often result in bias and loss of power. The seriousness of these issues for resulting inferences depends on both the mechanism causing the missing data and the form of the substantive question and associated model. The methodological literature on methods for the analysis of partially observed data has grown substantially over the last twenty years, and although there is increasing guidance on how to handle missing data, practice is changing slowly and misapprehensions abound, particularly in observational research. Importantly, the lack of transparency around methodological decisions regarding the analysis is threatening the validity and reproducibility of modern research. In this symposium leading researchers in missing data methodology will present practical guidance on how to select an appropriate method to handle missing data, describe how to report the results from such an analysis and describe how to conduct sensitivity analyses in the multiple imputation framework. Rationale for the symposium, including for its inclusion in the Congress One of the sub-themes of WCE 2021 is “Translation from research to policy and practice”. Although there is a growing body of literature surrounding missing data methodology, evidence from systematic reviews suggests that missing data is still often not handled appropriately. If practice is to change, it is important to educate applied researchers regarding the available methodology and provide practical guidance on how to determine the best method for handling missing data. An important part of this is providing guidance on the reporting of results from analyses with missing data. This is particularly pertinent given the current emphasis on reproducibility of research findings. In this symposium we bring some of the latest research from the Missing Data Topic Group of the STRengthening Analytical Thinking for Observational Studies (STRATOS) initiative whose aim is to provide accessible and accurate guidance in the design and analysis of observational studies in order to increasie the reliability and validity of observational research. Presentation program Names of presenters


2021 ◽  
Vol 11 (14) ◽  
pp. 6364
Author(s):  
Chun-Te Huang ◽  
Rong-Ching Chang ◽  
Yi-Lu Tsai ◽  
Kai-Chih Pai ◽  
Tsai-Jung Wang ◽  
...  

Acute kidney injury (AKI) refers to rapid decline of kidney function and is manifested by decreasing urine output or abnormal blood test (elevated serum creatinine). Electronic health records (EHRs) is fundamental for clinicians and machine learning algorithms to predict the clinical outcome of patients in the Intensive Care Unit (ICU). Early prediction of AKI could automatically warn the clinicians to review the possible risk factors and act in advance to prevent it. However, the enormous amount of patient data usually consists of a relatively incomplete data set and is very challenging for supervised machine learning process. In this paper, we propose an entropy-based feature engineering framework for vital signs based on their frequency of records. In particular, we address the missing at random (MAR) and missing not at random (MNAR) types of missing data according to different clinical scenarios. Regarding its applicability, we applied it to establish a prediction model for future AKI in ICU patients using 4278 ICU admissions from a tertiary hospital. Our result shows that the proposed entropy-based features are feasible to be used in the AKI prediction model and its performance improves as the data availability increases. In addition, we study the performance of AKI prediction model by comparing different time gaps and feature windows with the proposed vital sign entropy features. This work could be used as a guidance for feature windows selection and missing data processing during the development of a prediction model in ICU.


2020 ◽  
Vol 2020 ◽  
pp. 1-5
Author(s):  
Ru Yandong ◽  
Lv Xingfeng ◽  
Guo Jikun ◽  
Zhang Hongquan ◽  
Chen Lijuan

Coal and gas outburst has been one of the main threats to coal mine safety. Accurate coal and gas outburst prediction is the key to avoid accidents. The data is actual and complete by default in the existing prediction model. However, in fact, data missing and abnormal data value often occur, which results in poor prediction performance. Therefore, this paper proposes to use the correlation coefficient to complete the missing data filling in real time for the first time. The abnormal data identification is completed based on the Pauta criterion. Random forest model is used to realize the prediction model. The prediction performance of sensitivity 100%, accuracy 97.5%, and specificity 84.6% were obtained. Experiments show that the model can complete the prediction of coal and gas outburst in real time under the condition of missing data and abnormal data value, which can be used as a new prediction model of coal and gas outburst.


Atmosphere ◽  
2019 ◽  
Vol 10 (11) ◽  
pp. 718 ◽  
Author(s):  
Park ◽  
Kim ◽  
Lee ◽  
Kim ◽  
Song ◽  
...  

In this paper, we propose a new temperature prediction model based on deep learning by using real observed weather data. To this end, a huge amount of model training data is needed, but these data should not be defective. However, there is a limitation in collecting weather data since it is not possible to measure data that have been missed. Thus, the collected data are apt to be incomplete, with random or extended gaps. Therefore, the proposed temperature prediction model is used to refine missing data in order to restore missed weather data. In addition, since temperature is seasonal, the proposed model utilizes a long short-term memory (LSTM) neural network, which is a kind of recurrent neural network known to be suitable for time-series data modeling. Furthermore, different configurations of LSTMs are investigated so that the proposed LSTM-based model can reflect the time-series traits of the temperature data. In particular, when a part of the data is detected as missing, it is restored by using the proposed model’s refinement function. After all the missing data are refined, the LSTM-based model is retrained using the refined data. Finally, the proposed LSTM-based temperature prediction model can predict the temperature through three time steps: 6, 12, and 24 h. Furthermore, the model is extended to predict 7 and 14 day future temperatures. The performance of the proposed model is measured by its root-mean-squared error (RMSE) and compared with the RMSEs of a feedforward deep neural network, a conventional LSTM neural network without any refinement function, and a mathematical model currently used by the meteorological office in Korea. Consequently, it is shown that the proposed LSTM-based model employing LSTM-refinement achieves the lowest RMSEs for 6, 12, and 24 h temperature prediction as well as for 7 and 14 day temperature prediction, compared to other DNN-based and LSTM-based models with either no refinement or linear interpolation. Moreover, the prediction accuracy of the proposed model is higher than that of the Unified Model (UM) Local Data Assimilation and Prediction System (LDAPS) for 24 h temperature predictions.


Sign in / Sign up

Export Citation Format

Share Document