Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models

We conducted an extensive set of empirical analyses to examine the effect of the number of events per variable (EPV) on the relative performance of three different methods for assessing the predictive accuracy of a logistic regression model: apparent performance in the analysis sample, split-sample validation, and optimism correction using bootstrap methods. Using a single dataset of patients hospitalized with heart failure, we compared the estimates of discriminatory performance from these methods to those for a very large independent validation sample arising from the same population. As anticipated, the apparent performance was optimistically biased, with the degree of optimism diminishing as the number of events per variable increased. Differences between the bootstrap-corrected approach and the use of an independent validation sample were minimal once the number of events per variable was at least 20. Split-sample assessment resulted in too pessimistic and highly uncertain estimates of model performance. Apparent performance estimates had lower mean squared error compared to split-sample estimates, but the lowest mean squared error was obtained by bootstrap-corrected optimism estimates. For bias, variance, and mean squared error of the performance estimates, the penalty incurred by using split-sample validation was equivalent to reducing the sample size by a proportion equivalent to the proportion of the sample that was withheld for model validation. In conclusion, split-sample validation is inefficient and apparent performance is too optimistic for internal validation of regression-based prediction models. Modern validation methods, such as bootstrap-based optimism correction, are preferable. While these findings may be unsurprising to many statisticians, the results of the current study reinforce what should be considered good statistical practice in the development and validation of clinical prediction models.

Download Full-text

Global isoscapes for δ<sup>18</sup>O and δ<sup>2</sup>H in precipitation: improved prediction using regionalized climatic regression models

Hydrology and Earth System Sciences ◽

10.5194/hess-17-4713-2013 ◽

2013 ◽

Vol 17 (11) ◽

pp. 4713-4728 ◽

Cited By ~ 111

Author(s):

S. Terzer ◽

L. I. Wassenaar ◽

L. J. Araguás-Araguás ◽

P. K. Aggarwal

Keyword(s):

Large Scale ◽

Prediction Models ◽

Mean Squared Error ◽

Predictive Accuracy ◽

Isotope Composition ◽

Global Network ◽

Food Authenticity ◽

Squared Error ◽

Accuracy And Precision ◽

Spatio Temporal

Abstract. A regionalized cluster-based water isotope prediction (RCWIP) approach, based on the Global Network of Isotopes in Precipitation (GNIP), was demonstrated for the purposes of predicting point- and large-scale spatio-temporal patterns of the stable isotope composition (δ2H, δ18O) of precipitation around the world. Unlike earlier global domain and fixed regressor models, RCWIP predefined 36 climatic cluster domains and tested all model combinations from an array of climatic and spatial regressor variables to obtain the best predictive approach to each cluster domain, as indicated by root-mean-squared error (RMSE) and variogram analysis. Fuzzy membership fractions were thereafter used as the weights to seamlessly amalgamate results of the optimized climatic zone prediction models into a single predictive mapping product, such as global or regional amount-weighted mean annual, mean monthly, or growing-season δ18O/δ2H in precipitation. Comparative tests revealed the RCWIP approach outperformed classical global-fixed regression–interpolation-based models more than 67% of the time, and clearly improved upon predictive accuracy and precision. All RCWIP isotope mapping products are available as gridded GeoTIFF files from the IAEA website (www.iaea.org/water) and are for use in hydrology, climatology, food authenticity, ecology, and forensics.

Download Full-text

Mean squared error of ridge estimators in logistic regression

Statistica Neerlandica ◽

10.1111/stan.12201 ◽

2020 ◽

Vol 74 (2) ◽

pp. 159-191

Author(s):

Rok Blagus ◽

Jelle J. Goeman

Keyword(s):

Logistic Regression ◽

Mean Squared Error ◽

Squared Error

Download Full-text

Toward Better Understanding of the Contiguous Rain Area (CRA) Method for Spatial Forecast Verification

Weather and Forecasting ◽

10.1175/2009waf2222252.1 ◽

2009 ◽

Vol 24 (5) ◽

pp. 1401-1415 ◽

Cited By ~ 67

Author(s):

Elizabeth E. Ebert ◽

William A. Gallus

Keyword(s):

Correlation Coefficient ◽

Prediction Models ◽

Mean Squared Error ◽

Weather Prediction ◽

Forecast Errors ◽

Forecast Verification ◽

Rain Area ◽

Squared Error ◽

Verification Methods ◽

The Mean

Abstract The contiguous rain area (CRA) method for spatial forecast verification is a features-based approach that evaluates the properties of forecast rain systems, namely, their location, size, intensity, and finescale pattern. It is one of many recently developed spatial verification approaches that are being evaluated as part of a Spatial Forecast Verification Methods Intercomparison Project. To better understand the strengths and weaknesses of the CRA method, it has been tested here on a set of idealized geometric and perturbed forecasts with known errors, as well as nine precipitation forecasts from three high-resolution numerical weather prediction models. The CRA method was able to identify the known errors for the geometric forecasts, but only after a modification was introduced to allow nonoverlapping forecast and observed features to be matched. For the perturbed cases in which a radar rain field was spatially translated and amplified to simulate forecast errors, the CRA method also reproduced the known errors except when a high-intensity threshold was used to define the CRA (≥10 mm h−1) and a large translation error was imposed (>200 km). The decomposition of total error into displacement, volume, and pattern components reflected the source of the error almost all of the time when a mean squared error formulation was used, but not necessarily when a correlation-based formulation was used. When applied to real forecasts, the CRA method gave similar results when either best-fit criteria, minimization of the mean squared error, or maximization of the correlation coefficient, was chosen for matching forecast and observed features. The diagnosed displacement error was somewhat sensitive to the choice of search distance. Of the many diagnostics produced by this method, the errors in the mean and peak rain rate between the forecast and observed features showed the best correspondence with subjective evaluations of the forecasts, while the spatial correlation coefficient (after matching) did not reflect the subjective judgments.

Download Full-text

CV-α: designing validations sets to increase the precision and enable multiple comparison tests in genomic prediction

10.1101/2020.11.11.376343 ◽

2020 ◽

Author(s):

Rafael Massahiro Yassue ◽

José Felipe Gonzaga Sabadin ◽

Giovanni Galli ◽

Filipe Couto Alves ◽

Roberto Fritsche-Neto

Keyword(s):

Genomic Prediction ◽

Cross Validation ◽

Prediction Models ◽

Mean Squared Error ◽

Predictive Ability ◽

Proof Of Concept ◽

Squared Error ◽

High Effect ◽

The Mean ◽

Fold Cross Validation

AbstractUsually, the comparison among genomic prediction models is based on validation schemes as Repeated Random Subsampling (RRS) or K-fold cross-validation. Nevertheless, the design of training and validation sets has a high effect on the way and subjectiveness that we compare models. Those procedures cited above have an overlap across replicates that might cause an overestimated estimate and lack of residuals independence due to resampling issues and might cause less accurate results. Furthermore, posthoc tests, such as ANOVA, are not recommended due to assumption unfulfilled regarding residuals independence. Thus, we propose a new way to sample observations to build training and validation sets based on cross-validation alpha-based design (CV-α). The CV-α was meant to create several scenarios of validation (replicates x folds), regardless of the number of treatments. Using CV-α, the number of genotypes in the same fold across replicates was much lower than K-fold, indicating higher residual independence. Therefore, based on the CV-α results, as proof of concept, via ANOVA, we could compare the proposed methodology to RRS and K-fold, applying four genomic prediction models with a simulated and real dataset. Concerning the predictive ability and bias, all validation methods showed similar performance. However, regarding the mean squared error and coefficient of variation, the CV-α method presented the best performance under the evaluated scenarios. Moreover, as it has no additional cost nor complexity, it is more reliable and allows the use of non-subjective methods to compare models and factors. Therefore, CV-α can be considered a more precise validation methodology for model selection.

Download Full-text

Predicting The Number of Tourists Based on Backpropagation Algorithm

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v5i3.3061 ◽

2021 ◽

Vol 5 (3) ◽

pp. 439-445

Author(s):

Dwi Marlina ◽

Fatchul Arifin

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Network Architecture ◽

Mean Squared Error ◽

Predictive Accuracy ◽

Training Process ◽

Backpropagation Algorithm ◽

Squared Error ◽

Artificial Neural ◽

Target Data

The number of tourists always fluctuates every month, as happened in Kaliadem Merapi, Sleman. The purpose of this research is to develop a prediction system for the number of tourists based on artificial neural networks. This study uses an artificial neural network for data processing methods with the backpropagation algorithm. This study carried out two processes, namely the training process and the testing process with stages consisting of: (1) Collecting input and target data, (2) Normalizing input and target data, (3) Creating artificial neural network architecture by utilizing GUI (Graphical User Interface) Matlab facilities. (4) Conducting training and testing processes, (5) Normalizing predictive data, (6) Analysis of predictive data. In the data analysis, the MSE (Mean Squared Error) value in the training process is 0.0091528 and in the testing process is 0.0051424. Besides, the validity value of predictive accuracy in the testing process is around 91.32%. The resulting MSE (Mean Squared Error) value is relatively small, and the validity value of prediction accuracy is relatively high, so this system can be used to predict the number of tourists in Kaliadem Merapi, Sleman.

Download Full-text

Predictive ability of Random Forests, Boosting, Support Vector Machines and Genomic Best Linear Unbiased Prediction in different scenarios of genomic evaluation

Animal Production Science ◽

10.1071/an15538 ◽

2017 ◽

Vol 57 (2) ◽

pp. 229 ◽

Cited By ~ 11

Author(s):

Farhad Ghafouri-Kesbi ◽

Ghodratollah Rahimi-Mianji ◽

Mahmood Honarvar ◽

Ardeshir Nejati-Javaremi

Keyword(s):

Mean Squared Error ◽

Predictive Accuracy ◽

Computational Time ◽

Support Vector ◽

Genomic Evaluation ◽

Linear Unbiased Prediction ◽

Squared Error ◽

Vector Machines ◽

Best Linear Unbiased ◽

Qtl Effects

Three machine learning algorithms: Random Forests (RF), Boosting and Support Vector Machines (SVM) as well as Genomic Best Linear Unbiased Prediction (GBLUP) were used to predict genomic breeding values (GBV) and their predictive performance was compared in different combinations of heritability (0.1, 0.3, and 0.5), number of quantitative trait loci (QTL) (100, 1000) and distribution of QTL effects (normal, uniform and gamma). To this end, a genome comprised of five chromosomes, one Morgan each, was simulated on which 10000 bi-allelic single nucleotide polymorphisms were distributed. Pearson’s correlation between the true and predicted GBV and Mean Squared Error of GBV prediction were used, respectively, as measures of the predictive accuracy and the overall fit achieved with each method. In all methods, an increase in accuracy of prediction was seen following increase in heritability and decrease in the number of QTL. GBLUP had better predictive accuracy than machine learning methods in particular in the scenarios of higher number of QTL and normal and uniform distributions of QTL effects; though in most cases, the differences were non-significant. In the scenarios of small number of QTL and gamma distribution of QTL effects, Boosting outperformed other methods. Regarding Mean Squared Error of GBV prediction, in most cases Boosting outperformed other methods, although the estimates were close to that of GBLUP. Among methods studied, SVM with 0.6 gigabytes (GIG) was the most efficient user of memory followed by RF, GBLUP and Boosting with 1.2-GIG, 1.3-GIG and 2.3-GIG memory requirements, respectively. Regarding computational time, GBLUP, SVM, RF and Boosting ranked first, second, third and last with 10 min, 15 min, 75 min and 600 min, respectively. It was concluded that although stochastic gradient Boosting can predict GBV with high prediction accuracy, significantly longer computational time and memory requirement can be a serious limitation for this algorithm. Therefore, using of other variants of Boosting such as Random Boosting was recommended for genomic evaluation.

Download Full-text

Predicting corporate financial distress using logistic regression : Malaysian evidence

Social and Management Research Journal ◽

10.24191/smrj.v3i1.5108 ◽

2006 ◽

Vol 3 (1) ◽

pp. 123 ◽

Cited By ~ 1

Author(s):

You Hoo Tew ◽

Enylina Nordin

Keyword(s):

Logistic Regression ◽

Prediction Model ◽

Financial Distress ◽

Prediction Models ◽

Current Model ◽

Validation Sample ◽

Policy Makers ◽

Pooled Data ◽

Financial Distress Prediction ◽

Distress Prediction

This study attempts to construct and test financial distress prediction model for Malaysian Companies. The samplefor this study consists of84 companies listed on Bursa Malaysia that became financially distressed in 200/ and 2002 and a matched (by industry and firm size) sample 0/ 84 financially healthy companies. The model is constructed by employing logistic regression analysis based on pooled data of5 years prior tofinancial distress. The model isfirst derived using the estimation sample andthen tested using the validation sample. Adding to the existing research onfinancial distress prediction models, the current model utilizes measures ofshareholders' equity to total liabilities, shareholders' equity to total assets, current liabilities to total assets, total borrowings to total assets andinventory turnover. The results are encouraging, as the model developed/or predicting corporatefinancial distress in Malaysia is reliable up to 5 years prior to financial distress. II is also believed thai the prediction model can be useful to different groups of users such as policy makers, financial institutions, creditors, managers, bankers, investors and shareholders.

Download Full-text

Correlation Determination between COVID-19 and Weather Parameters Using Time Series Forecasting: A Case Study in Pakistan

Mathematical Problems in Engineering ◽

10.1155/2021/9953283 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Humera Batool ◽

Lixin Tian

Keyword(s):

Time Series ◽

Prediction Models ◽

Mean Squared Error ◽

Time Series Prediction ◽

Detailed Examination ◽

Percentage Error ◽

Mortality Data ◽

Squared Error ◽

Temperature And Humidity ◽

Weather Parameters

Infectious diseases like COVID-19 spread rapidly and have led to substantial economic loss worldwide, including in Pakistan. The effect of weather on COVID-19 spreading needs more detailed examination, as some studies have claimed to mitigate its spread. COVID-19 was declared a pandemic by WHO and has been reported in about 210 countries worldwide, including Asia, Europe, the USA, and North America. Person-to-person contact and international air travel between the nations were the leading causes behind the spreading of SARS-CoV-2 from its point of origin, besides the natural forces. However, further spread and infection within the community or country can be aided by natural elements, such as the weather. Therefore, the correlation between COVID-19 and temperature can be better elucidated in countries like Pakistan, where SARS-CoV-2 has affected at least 0.37 million people. This study collected Pakistan’s COVID-19 infection and mortality data for ten months (March–December 2020). Related weather parameters, temperature, and humidity were also obtained for the same course of time. The collected data were processed and used to compare the performance of various time series prediction models in terms of mean squared error (MSE), root-mean-squared error (RMSE), and mean absolute percentage error (MAPE). This paper, using the time series model, estimates the effect of humidity, temperature, and other weather parameters on COVID-19 transmission by obtaining the correlation among the total infected cases and the number of deaths and weather variables in a particular region. Results depict that weather parameters hold more influence in evaluating the sum number of cases and deaths than other factors like community, age, and the total population. Therefore, temperature and humidity are salient parameters for predicting COVID-19 affected instances. Moreover, it is concluded that the higher the temperature, the lesser the mortality due to COVID-19 infection.

Download Full-text

Fat mass estimation in neonates: anthropometric models compared with air displacement plethysmography

British Journal Of Nutrition ◽

10.1017/s0007114518003355 ◽

2018 ◽

Vol 121 (3) ◽

pp. 285-290 ◽

Cited By ~ 5

Author(s):

Jami L. Josefson ◽

Michael Nodzenski ◽

Octavious Talbot ◽

Denise M. Scholtens ◽

Patrick Catalano

Keyword(s):

Fat Mass ◽

Prediction Error ◽

Mean Squared Error ◽

Epidemiological Studies ◽

Concordance Correlation ◽

Air Displacement Plethysmography ◽

Mass Estimation ◽

Independent Validation ◽

Squared Error ◽

Mean Prediction Error

AbstractNewborn adiposity, a nutritional measure of the maternal–fetal intra-uterine environment, is representative of future metabolic health. An anthropometric model using weight, length and flank skinfold to estimate neonatal fat mass has been used in numerous epidemiological studies. Air displacement plethysmography (ADP), a non-invasive technology to measure body composition, is impractical for large epidemiological studies. The study objective was to determine the consistency of the original anthropometric fat mass estimation equation with ADP. Full-term neonates were studied at 12–72 h of life with weight, length, head circumference, flank skinfold thickness and ADP measurements. Statistical analyses evaluated three models to predict neonatal fat mass. Lin’s concordance correlation coefficient, mean prediction error and root mean squared error between the predicted and observed ADP fat mass values were used to evaluate the models, where ADP was considered the gold standard. A multi-ethnic cohort of 468 neonates were studied. Models (M) for predicting fat mass were developed using 349 neonates from site 1, then independently evaluated in 119 neonates from site 2. M0 was the original anthropometric model, M1 used the same variables as M0 but with updated parameters and M2 additionally included head circumference. In the independent validation cohort, Lin’s concordance correlation estimates demonstrated reasonable accuracy (model 0: 0·843, 1: 0·732, 2: 0·747). Mean prediction error and root mean squared error in the independent validation was much smaller for M0 compared with M1 and M2. The original anthropometric model to estimate neonatal fat mass is reasonable for predicting ADP, thus we advocate its continued use in epidemiological studies.

Download Full-text

Machine Learning Prediction Models of Electrical Efficiency of Photovoltaic-Thermal Collectors

10.20944/preprints201905.0033.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mohammad Hossein Ahmadi ◽

Alireza Baghban ◽

Ely Salwana ◽

Milad Sadeghzadeh ◽

Mohammad Zamen ◽

...

Keyword(s):

Solar Collector ◽

Fossil Fuels ◽

Prediction Models ◽

Mean Squared Error ◽

Support Vector ◽

Inlet Temperature ◽

Collector System ◽

Squared Error ◽

Neuro Fuzzy ◽

Great Performance

Solar energy is a renewable resources of energy which is broadly utilized and have the least pollution impact between the available alternatives of fossil fuels. In this investigation, machine leaening approaches of neural networks (NN), neuro-fuzzy and least squares support vector machine (LSSVM) are used to build the models for prediction of the thermal performance of a photovoltaic-thermal solar collector (PV/T) by estimating its efficiency as an output of the model while inlet temperature, flow rate, heat, solar radiation, and heat of sun are input of the designed model. Experimental measurements was prepared by designing a solar collector system and 100 data extracted. Different analyses are also performed to examine the credibility of the introduced approaches revealing great performance. The suggested LSSVM model represented the best performance regarding the mean squared error (MSE) of 0.003 and correlation coefficient (R2) value of 0.99, respectively.

Download Full-text