scholarly journals Comparison of some correlation measures for continuous and categorical data

2019 ◽  
Vol 56 (2) ◽  
pp. 253-261
Author(s):  
Ewa Skotarczak ◽  
Anita Dobek ◽  
Krzysztof Moliński

SummaryIn the literature there can be found a wide collection of correlation and association coefficients used for different structures of data. Generally, some of the correlation coefficients are conventionally used for continuous data and others for categorical or ordinal observations. The aim of this paper is to verify the performance of various approaches to correlation coefficient estimation for several types of observations. Both simulated and real data were analysed. For continuous variables, Pearson’s r2 and MIC were determined, whereas for categorized data three approaches were compared: Cramér’s V, Joe’s estimator, and the regression-based estimator. Two method of discretization for continuous data were used. The following conclusions were drawn: the regression-based approach yielded the best results for data with the highest assumed r2 coefficient, whereas Joe’s estimator was the better approximation of true correlation when the assumed r2 was small; and the MIC estimator detected the maximal level of dependency for data having a quadratic relation. Moreover, the discretization method applied to data with a non-linear dependency can cause loss of dependency information. The calculations were supported by the R packages arules and minerva.

2015 ◽  
Vol 26 (6) ◽  
pp. 2586-2602 ◽  
Author(s):  
Irantzu Barrio ◽  
Inmaculada Arostegui ◽  
María-Xosé Rodríguez-Álvarez ◽  
José-María Quintana

When developing prediction models for application in clinical practice, health practitioners usually categorise clinical variables that are continuous in nature. Although categorisation is not regarded as advisable from a statistical point of view, due to loss of information and power, it is a common practice in medical research. Consequently, providing researchers with a useful and valid categorisation method could be a relevant issue when developing prediction models. Without recommending categorisation of continuous predictors, our aim is to propose a valid way to do it whenever it is considered necessary by clinical researchers. This paper focuses on categorising a continuous predictor within a logistic regression model, in such a way that the best discriminative ability is obtained in terms of the highest area under the receiver operating characteristic curve (AUC). The proposed methodology is validated when the optimal cut points’ location is known in theory or in practice. In addition, the proposed method is applied to a real data-set of patients with an exacerbation of chronic obstructive pulmonary disease, in the context of the IRYSS-COPD study where a clinical prediction rule for severe evolution was being developed. The clinical variable PCO2 was categorised in a univariable and a multivariable setting.


2021 ◽  
Author(s):  
Rosa F Ropero ◽  
M Julia Flores ◽  
Rafael Rumí

<p>Environmental data often present missing values or lack of information that make modelling tasks difficult. Under the framework of SAICMA Research Project, a flood risk management system is modelled for Andalusian Mediterranean catchment using information from the Andalusian Hydrological System. Hourly data were collected from October 2011 to September 2020, and present two issues:</p><ul><li>In Guadarranque River, for the dam level variable there is no data from May to August 2020, probably because of sensor damage.</li> <li>No information about river level is collected in the lower part of Guadiaro River, which make difficult to estimate flood risk in the coastal area.</li> </ul><p>In order to avoid removing dam variable from the entire model (or those missing months), or even reject modelling one river system, this abstract aims to provide modelling solutions based on Bayesian networks (BNs) that overcome this limitation.</p><p><em>Guarranque River. Missing values.</em></p><p>Dataset contains 75687 observations for 6 continuous variables. BNs regression models based on fixed structures (Naïve Bayes, NB, and Tree Augmented Naïve, TAN) were learnt using the complete dataset (until September 2019) with the aim of predicting the dam level variable as accurately as possible. A scenario was carried out with data from October 2019 to March 2020 and compared the prediction made for the target variable with the real data. Results show both NB (rmse: 6.29) and TAN (rmse: 5.74) are able to predict the behaviour of the target variable.</p><p>Besides, a BN based on expert’s structural learning was learnt with real data and both datasets with imputed values by NB and TAN. Results show models learnt with imputed data (NB: 3.33; TAN: 3.07) improve the error rate of model with respect to real data (4.26).</p><p><em>Guadairo River. Lack of information.</em></p><p>Dataset contains 73636 observations with 14 continuous variables. Since rainfall variables present a high percentage of zero values (over 94%), they were discretised by Equal Frequency method with 4 intervals. The aim is to predict flooding risk in the coastal area but no data is collected from this area. Thus, an unsupervised classification based on hybrid BNs was performed. Here, target variable classifies all observations into a set of homogeneous groups and gives, for each observation, the probability of belonging to each group. Results show a total of 3 groups:</p><ul><li>Group 0, “Normal situation”: with rainfall values equal to 0, and mean of river level very low.</li> <li>Group 1, “Storm situation”: mean rainfall values are over 0.3 mm and all river level variables duplicate the mean with respect to group 0.</li> <li>Group 2, “Extreme situation”: Both rainfall and river level means values present the highest values far away from both previous groups.</li> </ul><p>Even when validation shows this methodology is able to identify extreme events, further work is needed. In this sense, data from autumn-winter season (from October 2020 to March 2021) will be used. Including this new information it would be possible to check if last extreme events (flooding event during December and Filomenastorm during January) are identified.</p><p> </p><p> </p><p> </p>


2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.


Entropy ◽  
2018 ◽  
Vol 21 (1) ◽  
pp. 22 ◽  
Author(s):  
Jordi Belda ◽  
Luis Vergara ◽  
Gonzalo Safont ◽  
Addisson Salazar

Conventional partial correlation coefficients (PCC) were extended to the non-Gaussian case, in particular to independent component analysis (ICA) models of the observed multivariate samples. Thus, the usual methods that define the pairwise connections of a graph from the precision matrix were correspondingly extended. The basic concept involved replacing the implicit linear estimation of conventional PCC with a nonlinear estimation (conditional mean) assuming ICA. Thus, it is better eliminated the correlation between a given pair of nodes induced by the rest of nodes, and hence the specific connectivity weights can be better estimated. Some synthetic and real data examples illustrate the approach in a graph signal processing context.


2015 ◽  
Vol 23 (4) ◽  
pp. 550-563 ◽  
Author(s):  
Daniel L. Oberski ◽  
Jeroen K. Vermunt ◽  
Guy B. D. Moors

Many variables crucial to the social sciences are not directly observed but instead are latent and measured indirectly. When an external variable of interest affects this measurement, estimates of its relationship with the latent variable will then be biased. Such violations of “measurement invariance” may, for example, confound true differences across countries in postmaterialism with measurement differences. To deal with this problem, researchers commonly aim at “partial measurement invariance” that is, to account for those differences that may be present and important. To evaluate this importance directly through sensitivity analysis, the “EPC-interest” was recently introduced for continuous data. However, latent variable models in the social sciences often use categorical data. The current paper therefore extends the EPC-interest to latent variable models for categorical data and demonstrates its use in example analyses of U.S. Senate votes as well as respondent rankings of postmaterialism values in the World Values Study.


1977 ◽  
Vol 2 (3) ◽  
pp. 171-186 ◽  
Author(s):  
Thomas R. Knapp

This paper summarizes the interrelationships among within-aggregate, between-aggregate, and total-group correlation coefficients, with artificial and “real-data” examples. It also discusses the relevance of correlation analyses at various levels of aggregation and some of the difficulties encountered in cross-level inference.


2020 ◽  
Vol 2 (2) ◽  
pp. 1-28
Author(s):  
Tao Li ◽  
Cheng Meng

Subsampling methods aim to select a subsample as a surrogate for the observed sample. As a powerful technique for large-scale data analysis, various subsampling methods are developed for more effective coefficient estimation and model prediction. This review presents some cutting-edge subsampling methods based on the large-scale least squares estimation. Two major families of subsampling methods are introduced: the randomized subsampling approach and the optimal subsampling approach. The former aims to develop a more effective data-dependent sampling probability while the latter aims to select a deterministic subsample in accordance with certain optimality criteria. Real data examples are provided to compare these methods empirically, respecting both the estimation accuracy and the computing time.


Author(s):  
О. V. Matsyura ◽  
М. V. Matsyura ◽  
А. А. Zimaroyeva

<p>For the analysis of long-term observations data on dynamics of bird populations the most suitable methods could be the stochastic processes. Abundance (density) of birds is calculated on the integrated area of studied habitats. Using the method of autocorrelation the correlogram of changes in number of birds drawn during the study period in all the area. After that, the calculation of the autocorrelation coefficients and partial autocorrelation are performed. The most appropriate model is the mixed autoregressive moving average (ARIMA). Ecological significance of autoregressive parameters is to display the frequency of changes in the number of birds in the seasonal and long-term aspects. The sliding average is one of the simplest methods, which allows reject the random fluctuations of the empirical regression line. Validation of the model could be conducted on truncated data series (10 years). The forecast is calculated for the next two years and compared with empirical data. Calculation of correlation coefficients between the real data and the forecast is performed using non-parametric Spearman correlation coefficient. The residual rows of selected models are estimated by residual correlogram. The constructed model can be used to analyze and forecast the number of birds in breeding biotopes.</p> <p><em>Keywords: analysis, density, indirect methods, birds, Simply Tagging.</em></p> <p> </p>


2021 ◽  
Author(s):  
Yoichiro Ogino ◽  
Natsue Fujikawa ◽  
Sayuri Koga ◽  
Ryoji Moroi ◽  
Kiyoshi Koyano

Abstract Purpose To investigate the profiles of swallowing and tongue functions, and to identify factors influencing swallowing in maxillectomy patients. Methods Maxillectomy patients whose swallowing function defined by Eating Assessment Tool (EAT-10) score and tongue functions (oral diadochokinesis: ODK, maximum tongue pressure: MTP) with or without maxillofacial prostheses had been evaluated were enrolled in this study. The effects of the history of radiotherapy or soft palate resection on swallowing function were evaluated. The effect of radiotherapy on oral dryness was also evaluated. To examine correlations of swallowing function with continuous variables, Spearman correlation coefficients were calculated. Results A total of 47 maxillectomy patients (23 males and 24 females, median age:71 [IQR: 63–76]) were registered. The median value of EAT-10 scores was 3 [IQR: 0–14]. Patients with the history of radiotherapy, but not with soft palate resection, showed significantly declined swallowing function. ODK and MTP of patients wearing maxillofacial prostheses were significantly improved. No significant effect of radiotherapy on oral dryness was found. A significant correlation was found between EAT-10 score and MTP (P = 0.04). Conclusions Swallowing function in maxillectomy patients was relatively impaired and the patients with the history of radiotherapy showed lower swallowing function. Maxillofacial prostheses could contribute to the improvement of MTP and ODK (/ta/). MTP may play a crucial role in swallowing in maxillectomy patients.


Sign in / Sign up

Export Citation Format

Share Document