dummy coding Latest Research Papers

Splitting on categorical predictors in random forests

PeerJ ◽

10.7717/peerj.6339 ◽

2019 ◽

Vol 7 ◽

pp. e6339 ◽

Cited By ~ 1

Author(s):

Marvin N. Wright ◽

Inke R. König

Keyword(s):

Computational Complexity ◽

Random Forests ◽

A Priori ◽

Principal Component ◽

Multiclass Classification ◽

Standard Approach ◽

Survival Prediction ◽

Learning Approaches ◽

Exponential Relationship ◽

Dummy Coding

One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2k − 1 − 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.

Data Prep 2-2: Dummy Coding Category Variables

Handbook of Statistical Analysis and Data Mining Applications ◽

10.1016/b978-0-12-416632-5.00033-5 ◽

2018 ◽

pp. 497-514

Author(s):

Roberta Bortolotti

Keyword(s):

Dummy Coding

Dummy coding vs effects coding for categorical variables: Clarifications and extensions

Journal of Choice Modelling ◽

10.1016/j.jocm.2016.09.005 ◽

2016 ◽

Vol 21 ◽

pp. 36-41 ◽

Cited By ~ 20

Author(s):

Andrew Daly ◽

Thijs Dekker ◽

Stephane Hess

Keyword(s):

Categorical Variables ◽

Dummy Coding

When size matters: advantages of weighted effect coding in observational studies

10.31235/osf.io/cgq6x ◽

2016 ◽

Author(s):

Rense Nieuwenhuis

Keyword(s):

Generalized Linear Model ◽

Observational Studies ◽

Regression Models ◽

The Body ◽

Sample Mean ◽

Dummy Variables ◽

Reference Category ◽

The Past ◽

Weighted Effect ◽

Dummy Coding

To include nominal and ordinal variables as predictors in regression models, their categories first have to be trans- formed into so-called ‘dummy variables’. There are many transformations available, and popular is ‘dummy coding’ in which the estimates represent deviations from a prese- lected ‘reference category’. A way to avoid choosing a reference category is effect coding, where the resulting estimates are deviations from a grand (unweighted) mean. An alternative for effect coding was given by Sweeney and Ulveling in 1972, which provides estimates representing deviations from the sample mean and is especially useful when the data are unbalanced (i.e., categories holding different numbers of observation). Despite its elegancy, this weighted effect coding has been cited only 35 times in the past 40 years, according to Google Scholar citations (more recent references include Hirschberg and Lye 2001 and Gober and Freeman 2005). Furthermore, it did not become a standard option in statistical packages such as SPSS and R. The aim of this paper is to revive weighted effect coding illustrated by recent research on the body mass index (BMI) and to provide easy-to-use syntax for SPSS, R, and Stata on http://www.ru.nl/sociology/mt/wec/ downloads. For didactical reasons we apply OLS regres- sion models, but it will be shown that weighted effect coding can be used in any generalized linear model.

dummy coding

The SAGE Dictionary of Statistics ◽

10.4135/9780857020123.n161 ◽

2015 ◽

Keyword(s):

Dummy Coding

Dictionary of Statistics & Methodology ◽

10.4135/9781412983907.n595 ◽

2015 ◽

Keyword(s):

Dummy Coding

Non-White, No More: Effect Coding as an Alternative to Dummy Coding With Implications for Higher Education Researchers

Journal of College Student Development ◽

10.1353/csd.2015.0019 ◽

2015 ◽

Vol 56 (2) ◽

pp. 170-175 ◽

Cited By ~ 23

Author(s):

Matthew J. Mayhew ◽

Jeffrey S. Simonoff

Keyword(s):

Higher Education ◽

Dummy Coding

Encyclopedia of Research Design ◽

10.4135/9781412961288.n123 ◽

2012 ◽

Keyword(s):

Dummy Coding

Encyclopedia of Epidemiology ◽

10.4135/9781412953948.n122 ◽

2012 ◽

Keyword(s):

Dummy Coding

Implications of Linear Versus Dummy Coding for Pooling of Information in Hierarchical Models

Quantitative Marketing and Marketing Management ◽

10.1007/978-3-8349-3722-3_8 ◽

2012 ◽

pp. 171-190

Author(s):

Thomas Otter ◽

Tetyana Kosyakova

Keyword(s):

Hierarchical Models ◽

Dummy Coding

dummy coding
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Splitting on categorical predictors in random forests

Data Prep 2-2: Dummy Coding Category Variables

Dummy coding vs effects coding for categorical variables: Clarifications and extensions

When size matters: advantages of weighted effect coding in observational studies

dummy coding

Dummy Coding

Non-White, No More: Effect Coding as an Alternative to Dummy Coding With Implications for Higher Education Researchers

Dummy Coding

Dummy Coding

Implications of Linear Versus Dummy Coding for Pooling of Information in Hierarchical Models

Export Citation Format

dummy codingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Splitting on categorical predictors in random forests

Data Prep 2-2: Dummy Coding Category Variables

Dummy coding vs effects coding for categorical variables: Clarifications and extensions

When size matters: advantages of weighted effect coding in observational studies

dummy coding

Dummy Coding

Non-White, No More: Effect Coding as an Alternative to Dummy Coding With Implications for Higher Education Researchers

Dummy Coding

Dummy Coding

Implications of Linear Versus Dummy Coding for Pooling of Information in Hierarchical Models

dummy coding
Recently Published Documents