Does the choice of nucleotide substitution models matter topologically?

Mapping Intimacies ◽

10.1101/041566 ◽

2016 ◽

Author(s):

Michael Hoff ◽

Stefan Peter Orf ◽

Benedikt Johannes Riehm ◽

Diego Darriba ◽

Alexandros Stamatakis

Keyword(s):

Model Selection ◽

Sample Size ◽

Nucleotide Substitution ◽

Information Criterion ◽

Information Criteria ◽

Substitution Models ◽

Tree Topologies ◽

Master Level ◽

Institute Of Technology ◽

Definition Of

Background: In the context of a master level programming practical at the computer science department of the Karlsruhe Institute of Technology, we developed and make available an open-source code for testing all 203 possible nucleotide substitution models in the Maximum Likelihood (ML) setting under the common Akaike, corrected Akaike, and Bayesian information criteria. We address the question if model selection matters topologically, that is, if conducting ML inferences under the optimal, instead of a standard General Time Reversible model, yields different tree topologies. We also assess, to which degree models selected and trees inferred under the three standard criteria (AIC, AICc, BIC) differ. Finally, we assess if the definition of the sample size (#sites versus #sites x #taxa) yields different models and, as a consequence, different tree topologies. Results: We find that, all three factors (by order of impact: nucleotide model selection, information criterion used, sample size definition) can yield topologically substantially different final tree topologies (topological difference exceeding 10%) for approximately 5% of the tree inferences conducted on the 39 empirical datasets used in our study. Conclusions: We find that, using the best-fit nucleotide substitution model may change the final ML tree topology compared to an inference under a default GTR model. The effect is less pronounced when comparing distinct information criteria. Nonetheless, in some cases we did obtain substantial topological differences.

Download Full-text

Sensitivity and specificity of information criteria

10.7287/peerj.preprints.1103v1 ◽

2015 ◽

Author(s):

John J. Dziak ◽

Donna L. Coffman ◽

Stephanie T. Lanza ◽

Runze Li

Keyword(s):

Model Selection ◽

Sample Size ◽

Likelihood Ratio Test ◽

Bayesian Information Criterion ◽

Information Criterion ◽

Information Criteria ◽

Ratio Test ◽

Relative Importance ◽

Missed Opportunities ◽

High Bias

Choosing a model with too few parameters can involve making unrealistically simple assumptions and lead to high bias, poor prediction, and missed opportunities for insight. Such models are not flexible enough to describe the sample or the population well. A model with too many parameters can t the observed data very well, but be too closely tailored to it. Such models may generalize poorly. Penalized-likelihood information criteria, such as Akaike's Information Criterion (AIC), the Bayesian Information Criterion (BIC), the Consistent AIC, and the Adjusted BIC, are widely used for model selection. However, different criteria sometimes support different models, leading to uncertainty about which criterion is the most trustworthy. In some simple cases the comparison of two models using information criteria can be viewed as equivalent to a likelihood ratio test, with the different models representing different alpha levels (i.e., different emphases on sensitivity or specificity; Lin & Dayton 1997). This perspective may lead to insights about how to interpret the criteria in less simple situations. For example, AIC or BIC could be preferable, depending on sample size and on the relative importance one assigns to sensitivity versus specificity. Understanding the differences among the criteria may make it easier to compare their results and to use them to make informed decisions.

Download Full-text

Sensitivity and specificity of information criteria

10.7287/peerj.preprints.1103v3 ◽

2017 ◽

Cited By ~ 5

Author(s):

John J. Dziak ◽

Donna L. Coffman ◽

Stephanie T. Lanza ◽

Runze Li

Keyword(s):

Model Selection ◽

Sample Size ◽

Likelihood Ratio Test ◽

Bayesian Information Criterion ◽

Information Criterion ◽

Information Criteria ◽

Ratio Test ◽

Relative Importance ◽

Missed Opportunities ◽

High Bias

Choosing a model with too few parameters can involve making unrealistically simple assumptions and lead to high bias, poor prediction, and missed opportunities for insight. Such models are not flexible enough to describe the sample or the population well. A model with too many parameters can fit the observed data very well, but be too closely tailored to it. Such models may generalize poorly. Penalized-likelihood information criteria, such as Akaike's Information Criterion (AIC), the Bayesian Information Criterion (BIC), the Consistent AIC, and the Adjusted BIC, are widely used for model selection. However, different criteria sometimes support different models, leading to uncertainty about which criterion is the most trustworthy. In some simple cases the comparison of two models using information criteria can be viewed as equivalent to a likelihood ratio test, with the different models representing different alpha levels (i.e., different emphases on sensitivity or specificity; Lin & Dayton 1997). This perspective may lead to insights about how to interpret the criteria in less simple situations. For example, AIC or BIC could be preferable, depending on sample size and on the relative importance one assigns to sensitivity versus specificity. Understanding the differences among the criteria may make it easier to compare their results and to use them to make informed decisions.

Download Full-text

Sensitivity and specificity of information criteria

10.7287/peerj.preprints.1103 ◽

2017 ◽

Cited By ~ 1

Author(s):

John J. Dziak ◽

Donna L. Coffman ◽

Stephanie T. Lanza ◽

Runze Li

Keyword(s):

Model Selection ◽

Sample Size ◽

Likelihood Ratio Test ◽

Bayesian Information Criterion ◽

Information Criterion ◽

Information Criteria ◽

Ratio Test ◽

Relative Importance ◽

Missed Opportunities ◽

High Bias

Choosing a model with too few parameters can involve making unrealistically simple assumptions and lead to high bias, poor prediction, and missed opportunities for insight. Such models are not flexible enough to describe the sample or the population well. A model with too many parameters can fit the observed data very well, but be too closely tailored to it. Such models may generalize poorly. Penalized-likelihood information criteria, such as Akaike's Information Criterion (AIC), the Bayesian Information Criterion (BIC), the Consistent AIC, and the Adjusted BIC, are widely used for model selection. However, different criteria sometimes support different models, leading to uncertainty about which criterion is the most trustworthy. In some simple cases the comparison of two models using information criteria can be viewed as equivalent to a likelihood ratio test, with the different models representing different alpha levels (i.e., different emphases on sensitivity or specificity; Lin & Dayton 1997). This perspective may lead to insights about how to interpret the criteria in less simple situations. For example, AIC or BIC could be preferable, depending on sample size and on the relative importance one assigns to sensitivity versus specificity. Understanding the differences among the criteria may make it easier to compare their results and to use them to make informed decisions.

Download Full-text

Sensitivity and specificity of information criteria

10.7287/peerj.preprints.1103v2 ◽

2015 ◽

Cited By ~ 1

Author(s):

John J. Dziak ◽

Donna L. Coffman ◽

Stephanie T. Lanza ◽

Runze Li

Keyword(s):

Model Selection ◽

Sample Size ◽

Likelihood Ratio Test ◽

Bayesian Information Criterion ◽

Information Criterion ◽

Information Criteria ◽

Ratio Test ◽

Relative Importance ◽

Missed Opportunities ◽

High Bias

Choosing a model with too few parameters can involve making unrealistically simple assumptions and lead to high bias, poor prediction, and missed opportunities for insight. Such models are not flexible enough to describe the sample or the population well. A model with too many parameters can fit the observed data very well, but be too closely tailored to it. Such models may generalize poorly. Penalized-likelihood information criteria, such as Akaike's Information Criterion (AIC), the Bayesian Information Criterion (BIC), the Consistent AIC, and the Adjusted BIC, are widely used for model selection. However, different criteria sometimes support different models, leading to uncertainty about which criterion is the most trustworthy. In some simple cases the comparison of two models using information criteria can be viewed as equivalent to a likelihood ratio test, with the different models representing different alpha levels (i.e., different emphases on sensitivity or specificity; Lin & Dayton 1997). This perspective may lead to insights about how to interpret the criteria in less simple situations. For example, AIC or BIC could be preferable, depending on sample size and on the relative importance one assigns to sensitivity versus specificity. Understanding the differences among the criteria may make it easier to compare their results and to use them to make informed decisions.

Download Full-text

Model Selection Procedures in Bounds Test of Cointegration: Theoretical Comparison and Empirical Evidence

Economies ◽

10.3390/economies8020049 ◽

2020 ◽

Vol 8 (2) ◽

pp. 49 ◽

Cited By ~ 1

Author(s):

Waqar Badshah ◽

Mehmet Bulut

Keyword(s):

Model Selection ◽

Akaike Information Criterion ◽

Bayesian Information Criterion ◽

Selection Process ◽

Information Criterion ◽

Small Sample ◽

Information Criteria ◽

Path Model ◽

Sample Sizes ◽

Bounds Test

Only unstructured single-path model selection techniques, i.e., Information Criteria, are used by Bounds test of cointegration for model selection. The aim of this paper was twofold; one was to evaluate the performance of these five routinely used information criteria {Akaike Information Criterion (AIC), Akaike Information Criterion Corrected (AICC), Schwarz/Bayesian Information Criterion (SIC/BIC), Schwarz/Bayesian Information Criterion Corrected (SICC/BICC), and Hannan and Quinn Information Criterion (HQC)} and three structured approaches (Forward Selection, Backward Elimination, and Stepwise) by assessing their size and power properties at different sample sizes based on Monte Carlo simulations, and second was the assessment of the same based on real economic data. The second aim was achieved by the evaluation of the long-run relationship between three pairs of macroeconomic variables, i.e., Energy Consumption and GDP, Oil Price and GDP, and Broad Money and GDP for BRICS (Brazil, Russia, India, China and South Africa) countries using Bounds cointegration test. It was found that information criteria and structured procedures have the same powers for a sample size of 50 or greater. However, BICC and Stepwise are better at small sample sizes. In the light of simulation and real data results, a modified Bounds test with Stepwise model selection procedure may be used as it is strongly theoretically supported and avoids noise in the model selection process.

Download Full-text

Using Model Selection Criteria to Choose the Number of Principal Components

Journal of Statistical Theory and Applications ◽

10.1007/s44199-021-00002-4 ◽

2021 ◽

Vol 20 (3) ◽

pp. 450-461

Author(s):

Stanley L. Sclove

Keyword(s):

Model Selection ◽

Principal Components ◽

Bayesian Information Criterion ◽

Selection Criteria ◽

Information Criterion ◽

Information Criteria ◽

Akaike's Information Criterion ◽

Model Selection Criteria ◽

Adequate Number ◽

Number Of Principal Components

AbstractThe use of information criteria, especially AIC (Akaike’s information criterion) and BIC (Bayesian information criterion), for choosing an adequate number of principal components is illustrated.

Download Full-text

On the Use of Information Criteria for Model Selection in Phylogenetics

Molecular Biology and Evolution ◽

10.1093/molbev/msz228 ◽

2019 ◽

Vol 37 (2) ◽

pp. 549-562 ◽

Cited By ~ 1

Author(s):

Edward Susko ◽

Andrew J Roger

Keyword(s):

Model Selection ◽

Bayes Factor ◽

Information Criterion ◽

Information Criteria ◽

Leibler Divergence ◽

Incorrect Model ◽

Substantial Bias ◽

And Performance ◽

Phylogenetic Models ◽

Selection Of

Abstract The information criteria Akaike information criterion (AIC), AICc, and Bayesian information criterion (BIC) are widely used for model selection in phylogenetics, however, their theoretical justification and performance have not been carefully examined in this setting. Here, we investigate these methods under simple and complex phylogenetic models. We show that AIC can give a biased estimate of its intended target, the expected predictive log likelihood (EPLnL) or, equivalently, expected Kullback–Leibler divergence between the estimated model and the true distribution for the data. Reasons for bias include commonly occurring issues such as small edge-lengths or, in mixture models, small weights. The use of partitioned models is another issue that can cause problems with information criteria. We show that for partitioned models, a different BIC correction is required for it to be a valid approximation to a Bayes factor. The commonly used AICc correction is not clearly defined in partitioned models and can actually create a substantial bias when the number of parameters gets large as is the case with larger trees and partitioned models. Bias-corrected cross-validation corrections are shown to provide better approximations to EPLnL than AIC. We also illustrate how EPLnL, the estimation target of AIC, can sometimes favor an incorrect model and give reasons for why selection of incorrectly under-partitioned models might be desirable in partitioned model settings.

Download Full-text

Normalized Information Criteria and Model Selection in the Presence of Missing Data

Mathematics ◽

10.3390/math9192474 ◽

2021 ◽

Vol 9 (19) ◽

pp. 2474

Author(s):

Nitzan Cohen ◽

Yakir Berchenko

Keyword(s):

Missing Data ◽

Model Selection ◽

Missing Values ◽

Model Averaging ◽

Information Criterion ◽

Current Theory ◽

Information Criteria ◽

Alternative Methods ◽

Sample Sizes ◽

Statistical Efficiency

Information criteria such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are commonly used for model selection. However, the current theory does not support unconventional data, so naive use of these criteria is not suitable for data with missing values. Imputation, at the core of most alternative methods, is both distorted as well as computationally demanding. We propose a new approach that enables the use of classic well-known information criteria for model selection when there are missing data. We adapt the current theory of information criteria through normalization, accounting for the different sample sizes used for each candidate model (focusing on AIC and BIC). Interestingly, when the sample sizes are different, our theoretical analysis finds that AICj/nj is the proper correction for AICj that we need to optimize (where nj is the sample size available to the jth model) while −(BICj−BICi)/(nj−ni) is the correction of BIC. Furthermore, we find that the computational complexity of normalized information criteria methods is exponentially better than that of imputation methods. In a series of simulation studies, we find that normalized-AIC and normalized-BIC outperform previous methods (i.e., normalized-AIC is more efficient, and normalized BIC includes only important variables, although it tends to exclude some of them in cases of large correlation). We propose three additional methods aimed at increasing the statistical efficiency of normalized-AIC: post-selection imputation, Akaike sub-model averaging, and minimum-variance averaging. The latter succeeds in increasing efficiency further.

Download Full-text

Model Selection for Multilevel Mixture Rasch Models

Applied Psychological Measurement ◽

10.1177/0146621618779990 ◽

2018 ◽

Vol 43 (4) ◽

pp. 272-289 ◽

Cited By ~ 3

Author(s):

Sedat Sen ◽

Allan S. Cohen ◽

Seock-Ho Kim

Keyword(s):

Model Selection ◽

Sample Size ◽

Simulation Study ◽

Akaike Information Criterion ◽

Bayesian Information Criterion ◽

Information Criterion ◽

Total Sample ◽

Size Number ◽

Information Indices ◽

Level 2

Mixture item response theory (MixIRT) models can sometimes be used to model the heterogeneity among the individuals from different subpopulations, but these models do not account for the multilevel structure that is common in educational and psychological data. Multilevel extensions of the MixIRT models have been proposed to address this shortcoming. Successful applications of multilevel MixIRT models depend in part on detection of the best fitting model. In this study, performance of information indices, Akaike information criterion (AIC), Bayesian information criterion (BIC), consistent Akaike information criterion (CAIC), and sample-size adjusted Bayesian information criterion (SABIC), were compared for use in model selection with a two-level mixture Rasch model in the context of a real data example and a simulation study. Level 1 consisted of students and Level 2 consisted of schools. The performances of the model selection criteria under different sample sizes were investigated in a simulation study. Total sample size (number of students) and Level 2 sample size (number of schools) were studied for calculation of information criterion indices to examine the performance of these fit indices. Simulation study results indicated that CAIC and BIC performed better than the other indices at detection of the true (i.e., generating) model. Furthermore, information indices based on total sample size yielded more accurate detections than indices at Level 2.

Download Full-text

The Efficacy of Common Fit Indices for Enumerating Classes in Growth Mixture Models When Nested Data Structure Is Ignored

SAGE Open ◽

10.1177/2158244017700459 ◽

2017 ◽

Vol 7 (1) ◽

pp. 215824401770045 ◽

Cited By ~ 9

Author(s):

Qi Chen ◽

Wen Luo ◽

Gregory J. Palardy ◽

Ryan Glaman ◽

Amber McEnturff

Keyword(s):

Model Selection ◽

Sample Size ◽

Likelihood Ratio ◽

Likelihood Ratio Test ◽

Information Criterion ◽

Ratio Test ◽

Growth Mixture Model ◽

Growth Mixture ◽

Selection Indices ◽

Number Of Classes

Growth mixture model (GMM) is a flexible statistical technique for analyzing longitudinal data when there are unknown heterogeneous subpopulations with different growth trajectories. When individuals are nested within clusters, multilevel growth mixture model (MGMM) should be used to account for the clustering effect. A review of recent literature shows that a higher level of nesting was described in 43% of articles using GMM, none of which used MGMM to account for the clustered data. We conjecture that researchers sometimes ignore the higher level to reduce analytical complexity, but in other situations, ignoring the nesting is unavoidable. This Monte Carlo study investigated whether the correct number of classes can still be retrieved when a higher level of nesting in MGMM is ignored. We investigated six commonly used model selection indices: Akaike information criterion (AIC), consistent AIC (CAIC), Bayesian information criterion (BIC), sample size–adjusted BIC (SABIC), Vuong–Lo–Mendell–Rubin likelihood ratio test (VLMR), and adjusted Lo–Mendell–Rubin likelihood ratio test (ALMR). Results showed that accuracy of class enumeration decreased for all six indices when the higher level is ignored. BIC, CAIC, and SABIC were the most effective model selection indices under the misspecified model. BIC and CAIC were preferable when sample size was large and/or intraclass correlation (ICC) was small, whereas SABIC performed better when sample size was small and/or ICC was large. In addition, SABIC and VLMR/ALMR tended to overextract the number of classes when there are more than two subpopulations and the sample size is large.

Download Full-text