Handling missing data in a rheumatoid arthritis registry using random forest approach

Background:Missing data in clinical epidemiological researches violate the intention to treat principle,reduce statistical power and can induce bias if they are related to patient’s response to treatment. In multiple imputation (MI), covariates are included in the imputation equation to predict the values of missing data.Objectives:To find the best approach to estimate and impute the missing values in Kuwait Registry for Rheumatic Diseases (KRRD) patients data.Methods:A number of methods were implemented for dealing with missing data. These includedMultivariate imputation by chained equations(MICE),K-Nearest Neighbors(KNN),Bayesian Principal Component Analysis(BPCA),EM with Bootstrapping(Amelia II),Sequential Random Forest(MissForest) and mean imputation. Choosing the best imputation method wasjudged by the minimum scores ofRoot Mean Square Error(RMSE),Mean Absolute Error(MAE) andKolmogorov–Smirnov D test statistic(KS) between the imputed datapoints and the original datapoints that were subsequently sat to missing.Results:A total of 1,685 rheumatoid arthritis (RA) patients and 10,613 hospital visits were included in the registry. Among them, we found a number of variables that had missing values exceeding 5% of the total values. These included duration of RA (13.0%), smoking history (26.3%), rheumatoid factor (7.93%), anti-citrullinated peptide antibodies (20.5%), anti-nuclear antibodies (20.4%), sicca symptoms (19.2%), family history of a rheumatic disease (28.5%), steroid therapy (5.94%), ESR (5.16%), CRP (22.9%) and SDAI (38.0%), The results showed that among the methods used, MissForest gave the highest level of accuracy to estimate the missing values. It had the least imputation errors for both continuous and categorical variables at each frequency of missingness and it had the smallest prediction differences when the models used imputed laboratory values. In both data sets, MICE had the second least imputation errors and prediction differences, followed by KNN and mean imputation.Conclusion:MissForest is a highly accurate method of imputation for missing data in KRRD and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in clinical predictive models. This approach can be used in registries to improve the accuracy of data, including the ones for rheumatoid arthritis patients.References:[1]Junninen, H.; Niska, H.; Tuppurainen, K.; Ruuskanen, J.; Kolehmainen, M. Methods for imputation ofmissing values in air quality data sets.Atmospheric Environment2004,38, 2895–2907.[2]Norazian, M.N.; Shukri, Y.A.; Azam, R.N.; Al Bakri, A.M.M. Estimation of missing values in air pollutiondata using single imputation techniques.ScienceAsia2008,34, 341–345.[3]Plaia, A.; Bondi, A. Single imputation method of missing values in environmental pollution data sets.Atmospheric Environment2006,40, 7316–7330.[4]Kabir, G.; Tesfamariam, S.; Hemsing, J.; Sadiq, R. Handling incomplete and missing data in water networkdatabase using imputation methods.Sustainable and Resilient Infrastructure2019, pp. 1–13.[5]Di Zio, M.; Guarnera, U.; Luzi, O. Imputation through finite Gaussian mixture models.ComputationalStatistics & Data Analysis2007,51, 5305–5316.Disclosure of Interests:None declared

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

Random Forest Missing Data Imputation Methods: Implications for Predicting At-Risk Students

Advances in Intelligent Systems and Computing - Intelligent Systems Design and Applications ◽

10.1007/978-3-030-49342-4_29 ◽

2020 ◽

pp. 298-308

Author(s):

Bevan I. Smith ◽

Charles Chimedza ◽

Jacoba H. Bührmann

Keyword(s):

At Risk ◽

Missing Data ◽

Random Forest ◽

At Risk Students ◽

Data Imputation ◽

Missing Data Imputation ◽

Imputation Methods

Download Full-text

Investigating the Performance of cart- and Random Forest- Based Procedures for Dealing with Longitudinal Dropout in Small Sample Designs under mnar Missing Data

Longitudinal Multivariate Psychology ◽

10.4324/9781315160542-11 ◽

2018 ◽

pp. 212-239 ◽

Cited By ~ 1

Author(s):

Timothy Hayes

Keyword(s):

Missing Data ◽

Random Forest ◽

Small Sample

Download Full-text

Effect of the Temporal Gradient of Vegetation Indices on Early-Season Wheat Classification Using the Random Forest Classifier

Applied Sciences ◽

10.3390/app8081216 ◽

2018 ◽

Vol 8 (8) ◽

pp. 1216 ◽

Cited By ~ 5

Author(s):

Mousa Abad ◽

Ali Abkar ◽

Barat Mojaradi

Keyword(s):

Missing Data ◽

Random Forest ◽

Winter Wheat ◽

Temporal Resolution ◽

Decision Makers ◽

Wheat Crop ◽

Early Season ◽

Temporal Gradient ◽

Spectral Bands ◽

Multi Temporal

Early-season area estimation of the winter wheat crop as a strategic product is important for decision-makers. Multi-temporal images are the best tool to measure early-season winter wheat crops, but there are issues with classification. Classification of multi-temporal images is affected by factors such as training sample size, temporal resolution, vegetation index (VI) type, temporal gradient of spectral bands and VIs, classifiers, and values missed under cloudy conditions. This study addresses the effect of the temporal resolution and VIs, along with the spectral and VIs gradient on the random forest (RF) classifier when missing data occurs in multi-temporal images. To investigate the appropriate temporal resolution for image acquisition, a study area is selected on an overlapping area between two Landsat Data Continuity Mission (LDCM) paths. In the proposed method, the missing data from cloudy pixels are retrieved using the average of the k-nearest cloudless pixels in the feature space. Next, multi-temporal image analysis is performed by considering different scenarios provided by decision-makers for the desired crop types, which should be extracted early in the season in the study areas. The classification results obtained by RF improved by 2.2% when the temporally-missing data were retrieved using the proposed method. Moreover, the experimental results demonstrated that when the temporal resolution of Landsat-8 is increased to one week, the classification task can be conducted earlier with slightly better overall accuracy (OA) and kappa values. The effect of incorporating VIs along with the temporal gradients of spectral bands and VIs into the RF classifier improved the OA by 3.1% and the kappa value by 6.6%, on average. The results show that if only three optimum images from seasonal changes in crops are available, the temporal gradient of the VIs and spectral bands becomes the primary tool available for discriminating wheat from barley. The results also showed that if wheat and barley are considered as single class versus other classes, with the use of images associated with 162 and 163 paths, both crops can be classified in March (at the beginning of the growth stage) with an overall accuracy of 97.1% and kappa coefficient of 93.5%.

Download Full-text

Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

American Journal of Epidemiology ◽

10.1093/aje/kwt312 ◽

2014 ◽

Vol 179 (6) ◽

pp. 764-774 ◽

Cited By ~ 140

Author(s):

Anoop D. Shah ◽

Jonathan W. Bartlett ◽

James Carpenter ◽

Owen Nicholas ◽

Harry Hemingway

Keyword(s):

Missing Data ◽

Random Forest

Download Full-text

Missing data in randomised controlled trials of rheumatoid arthritis drug therapy are substantial and handled inappropriately

RMD Open ◽

10.1136/rmdopen-2021-001708 ◽

2021 ◽

Vol 7 (2) ◽

pp. e001708

Author(s):

Nasim A Khan ◽

Karina D Torralba ◽

Fawad Aslam

Keyword(s):

Rheumatoid Arthritis ◽

Sensitivity Analysis ◽

Drug Therapy ◽

Missing Data ◽

Primary Outcome ◽

Outcome Data ◽

Data Handling ◽

Significant Discrepancy ◽

Imputation Methods ◽

Simple Imputation

ObjectivesTo analyse the amount, reporting and handling of missing data, approach to intention-to-treat (ITT) principle application and sensitivity analysis utilisation in randomised clinical trials (RCTs) of rheumatoid arthritis (RA). To assess the trend in such reporting 10 years apart (2006 and 2016).MethodsParallel group drug therapy RA RCTs with a clinical primary endpoint.Results176 studies enrolling a median of 160 (IQR 62–339) patients were eligible. In terms of actual analysis: 81 (46%) RCTs conducted ITT, 42 (23.9%) conducted modified ITT while 53 (30.1%) conducted non-ITT analysis. Only 58 of 97 (59.8%) RCTs reporting an ITT analysis actually performed it. The median (IQR) numbers of participants completing the trial and included in analysis for primary outcome were 86% (74%–91%) and 100% (97.1%–100%), respectively. 53 (32.7%) and 65 (40.1%) RCTs had >20% and 10%–20% missing primary outcome data, respectively. Missing data handling was unreported by 58 of 171 (33.9%) RCTs. When reported, vast majority used simple imputation methods. No significant trend towards improved reporting was seen between 2006 and 2016. Sensitivity analysis numerically improved from 2006 to 2016 (14.7% vs 21.4%).ConclusionsThere is significant discrepancy in the reported and the actual performed analysis in RA drug therapy RCTs. Nearly one-third of RCTs had >20% missing data. The reporting and methods of missing data handling remain inadequate with high usage of non-preferred simple imputation methods. Sensitivity analysis utilisation was low. No trend towards better missing data reporting and handling was seen.

Download Full-text

MISSING DATA IMPUTATION ON IOT SENSOR NETWORKS: IMPLICATIONS FOR ON-SITE SENSOR CALIBRATION

10.36227/techrxiv.13633529 ◽

2021 ◽

Author(s):

Nwamaka Okafor

Keyword(s):

Sensor Networks ◽

Missing Data ◽

Random Forest ◽

Data Collection ◽

Missing Values ◽

Imputation Accuracy ◽

Sensor Calibration ◽

Concentration Data ◽

Missing Data Imputation

IoT sensors are gaining more popularity in the environmental monitoring space due to their relatively small size, cost of acquisition and ease of installation and operation. They are becoming increasingly important<br>supplement to traditional monitoring systems, particularly for in-situ based monitoring. However, data collection based on IoT sensors are often plagued with missing values usually occurring as a result of sensor faults, network failures, drifts and other operational issues. Several imputation strategies have been proposed for handling missing values in various application domains. This paper examines the performance of different imputation techniques including Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for handling missing values on sensor networks deployed for the quantification of Green House Gases(GHGs). Two tasks were conducted: first, Ozone (O3) and NO2/O3 concentration data collected using Aeroqual and Cairclip sensors respectively over a six months data collection period were corrupted by removing data intervals at different missing periods (p) where p 2 f1day; 1week; 2weeks; 1monthg and also at random points on the dataset at varying proportion (r) where r 2 f5%; 10%; 30%; 50%; 70%g. The missing data were then filled using the different imputation strategies and their imputation accuracy calculated. Second, the performance of sensor calibration by different regression models including Multi Linear Regression (MLR), Decision Tree (DT), Random Forest (RF) and XGBoost (XGB) trained on the different imputed datasets were evaluated. The analysis showed the MICE technique to outperform the others in imputing the missing values on both the O3 and NO2/O3 datasets when missingness was introduced over periods p. MissForest, however, outperformed the rest when missingness was introduced as randomly occuring point errors. While the analysis demonstrated the effects of missing and imputed data on sensor calibration, experimental results showed that a simple model on the imputed dataset can achieve state of-the-art result on in-situ sensor calibration, improving the data quality of the sensor.

Download Full-text

Missing data completion method based on KNN and random forest

10.1117/12.2622876 ◽

2021 ◽

Author(s):

Songyu Zhang ◽

Yuchen Zhou ◽

Jinghua Yan ◽

Fanliang Bu

Keyword(s):

Missing Data ◽

Random Forest ◽

Data Completion

Download Full-text

Improvement of random forest by multiple imputation applied to tower crane accident prediction with missing data

Engineering Construction & Architectural Management ◽

10.1108/ecam-07-2021-0606 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Ling Jiang ◽

Tingsheng Zhao ◽

Chuxuan Feng ◽

Wei Zhang

Keyword(s):

Missing Data ◽

Random Forest ◽

Prediction Model ◽

Multiple Imputation ◽

Incomplete Data ◽

Missing Values ◽

Critical Factors ◽

Content Type ◽

Accident Prediction ◽

Tower Crane

PurposeThis research is aimed at predicting tower crane accident phases with incomplete data.Design/methodology/approachThe tower crane accidents are collected for prediction model training. Random forest (RF) is used to conduct prediction. When there are missing values in the new inputs, they should be filled in advance. Nevertheless, it is difficult to collect complete data on construction site. Thus, the authors use multiple imputation (MI) method to improve RF. Finally the prediction model is applied to a case study.FindingsThe results show that multiple imputation RF (MIRF) can effectively predict tower crane accident when the data are incomplete. This research provides the importance rank of tower crane safety factors. The critical factors should be focused on site, because the missing data affect the prediction results seriously. Also the value of critical factors influences the safety of tower crane.Practical implicationThis research promotes the application of machine learning methods for accident prediction in actual projects. According to the onsite data, the authors can predict the accident phase of tower crane. The results can be used for tower crane accident prevention.Originality/valuePrevious studies have seldom predicted tower crane accidents, especially the phase of accident. This research uses tower crane data collected on site to predict the phase of the tower crane accident. The incomplete data collection is considered in this research according to the actual situation.

Download Full-text