dummy coding
Recently Published Documents


TOTAL DOCUMENTS

11
(FIVE YEARS 0)

H-INDEX

3
(FIVE YEARS 0)

PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e6339 ◽  
Author(s):  
Marvin N. Wright ◽  
Inke R. König

One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2k − 1 − 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.





2016 ◽  
Vol 21 ◽  
pp. 36-41 ◽  
Author(s):  
Andrew Daly ◽  
Thijs Dekker ◽  
Stephane Hess


2016 ◽  
Author(s):  
Rense Nieuwenhuis

To include nominal and ordinal variables as predictors in regression models, their categories first have to be trans- formed into so-called ‘dummy variables’. There are many transformations available, and popular is ‘dummy coding’ in which the estimates represent deviations from a prese- lected ‘reference category’. A way to avoid choosing a reference category is effect coding, where the resulting estimates are deviations from a grand (unweighted) mean. An alternative for effect coding was given by Sweeney and Ulveling in 1972, which provides estimates representing deviations from the sample mean and is especially useful when the data are unbalanced (i.e., categories holding different numbers of observation). Despite its elegancy, this weighted effect coding has been cited only 35 times in the past 40 years, according to Google Scholar citations (more recent references include Hirschberg and Lye 2001 and Gober and Freeman 2005). Furthermore, it did not become a standard option in statistical packages such as SPSS and R. The aim of this paper is to revive weighted effect coding illustrated by recent research on the body mass index (BMI) and to provide easy-to-use syntax for SPSS, R, and Stata on http://www.ru.nl/sociology/mt/wec/ downloads. For didactical reasons we apply OLS regres- sion models, but it will be shown that weighted effect coding can be used in any generalized linear model.















Sign in / Sign up

Export Citation Format

Share Document