scholarly journals Két pszichológiai populáció sztochasztikus egyenlőségének ellenőrzésére alkalmas statisztikai próbák összehasonlító vizsgálata

2000 ◽  
Vol 55 (2-3) ◽  
pp. 253-281 ◽  
Author(s):  
András Vargha

A jelen tanulmányban a sztochasztikus egyenlőség ellenőrzésére alkalmas hat statisztikai próbát hasonlítottunk össze számítógépes szimulációval az érvényesség és a hatékonyság kritériuma szempontjából. Két populációt akkor mondunk sztochasztikusan egyenlőnek valamely X változó tekintetében, ha véletlenszerűen kiválasztva egy-egy X-értéket a két populációból, az elsőből kiválasztott érték ugyanakkora eséllyel lesz nagyobb a második kiválasztottnál, mint kisebb.A szimulációban széles tartományban variáltuk az eloszlások ferdeségét és csúcsosságát, valamint a szórásheterogenitás mértékét. Vizsgáltunk kicsi és közepes nagyságú, illetve egyenlő és különböző elemszámú mintákat. A szimulációba a korábban már mások által is javasolt próbák (rang t, rang Welch, Fligner-Policello, Cliff) mellett elméleti megfontolások alapján két új próbát (FPW és FPCW) is bevontunk.A szimulációs vizsgálatok arra az érdekes eredményre vezettek, hogy az újonnan javasolt két próba, FPW és FPCW az érvényesség tekintetében sokkal megbízhatóbb eljárásnak bizonyult, mint a többiek, miközben az erő tekintetében nem tapasztaltunk számottevő különbséget közöttük. Különösen FPW jeleskedett azzal, hogy I. fajta hibája sosem tért el számottevően a névleges szinttől. Közepes nagyságú minták esetén FPCW FPW-vel egyenértékű eljárás benyomását keltette.In the current paper six statistical tests of stochastic equality are to be compared by a Monte Carlo simulation with respect to Type I error and power. Two populations are said to be stochastically equal with respect to a variable X, if for any two independently and randomly drawn observations X1 and X2 from the two populations P(X1 ≯ X2) = P(X1 < X2).In the simulation the skewness and kurtosis levels as well as the extent of variance heterogeneity of the two parent distributions were varied across a wide range. The sample sizes applied were either small or moderate, and equal or unequal. The involved tests of stochastic equality were as follows: rank t test, rank Welch test, Fligner-Policello test, Cliff's modified Fligner-Policello test as well as two modifications of the last two tests, denoted FPW and FPCW, that utilized adjusted degrees of freedom.An interesting result obtained is that the two newly introduced test variants, FPW and FPCW, proved to be substantially more accurate with regard to their Type I error rates than the others, whereas they kept a similar power level. Specifically, the estimated Type I error of FPW at .05 nominal level always fell in the range of .043-.063 even if the variance ratio of the two distributions was as large as 1:16. The same ranges were .049-.068 for FPCW, but .029-.160 for the rank t test, .049-.096 for the rank Welch test, .035-.075 for the Fligner-Policello test, and .040-.078 for Cliff's test.

2015 ◽  
Vol 46 (3) ◽  
pp. 586-603 ◽  
Author(s):  
Ma Dolores Hidalgo ◽  
Isabel Benítez ◽  
Jose-Luis Padilla ◽  
Juana Gómez-Benito

The growing use of scales in survey questionnaires warrants the need to address how does polytomous differential item functioning (DIF) affect observed scale score comparisons. The aim of this study is to investigate the impact of DIF on the type I error and effect size of the independent samples t-test on the observed total scale scores. A simulation study was conducted, focusing on potential variables related to DIF in polytomous items, such as DIF pattern, sample size, magnitude, and percentage of DIF items. The results showed that DIF patterns and the number of DIF items affected the type I error rates and effect size of t-test values. The results highlighted the need to analyze DIF before making comparative group interpretations.


1994 ◽  
Vol 19 (3) ◽  
pp. 275-291 ◽  
Author(s):  
James Algina ◽  
T. C. Oshima ◽  
Wen-Ying Lin

Type I error rates were estimated for three tests that compare means by using data from two independent samples: the independent samples t test, Welch’s approximate degrees of freedom test, and James’s second-order test. Type I error rates were estimated for skewed distributions, equal and unequal variances, equal and unequal sample sizes, and a range of total sample sizes. Welch’s test and James’s test have very similar Type I error rates and tend to control the Type I error rate as well or better than the independent samples t test does. The results provide guidance about the total sample sizes required for controlling Type I error rates.


Author(s):  
Patrick J. Rosopa ◽  
Alice M. Brawley ◽  
Theresa P. Atkinson ◽  
Stephen A. Robertson

Preliminary tests for homoscedasticity may be unnecessary in general linear models. Based on Monte Carlo simulations, results suggest that when testing for differences between independent slopes, the unconditional use of weighted least squares regression and HC4 regression performed the best across a wide range of conditions.


1998 ◽  
Vol 10 (7) ◽  
pp. 1895-1923 ◽  
Author(s):  
Thomas G. Dietterich

This article reviews five approximate statistical tests for determining whether one learning algorithm outperforms another on a particular learning task. These test sare compared experimentally to determine their probability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paired-differences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemar's test, is shown to have low type I error. The fifth test is a new test, 5 × 2 cv, based on five iterations of twofold cross-validation. Experiments show that this test also has acceptable type I error. The article also measures the power (ability to detect algorithm differences when they do exist) of these tests. The cross-validated t test is the most powerful. The 5×2 cv test is shown to be slightly more powerful than McNemar's test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, Mc-Nemar's test is the only test with acceptable type I error. For algorithms that can be executed 10 times, the 5 × 2 cv test is recommended, because it is slightly more powerful and because it directly measures variation due to the choice of training set.


2019 ◽  
Vol 3 (Supplement_1) ◽  
Author(s):  
Keisuke Ejima ◽  
Andrew Brown ◽  
Daniel Smith ◽  
Ufuk Beyaztas ◽  
David Allison

Abstract Objectives Rigor, reproducibility and transparency (RRT) awareness has expanded over the last decade. Although RRT can be improved from various aspects, we focused on type I error rates and power of commonly used statistical analyses testing mean differences of two groups, using small (n ≤ 5) to moderate sample sizes. Methods We compared data from five distinct, homozygous, monogenic, murine models of obesity with non-mutant controls of both sexes. Baseline weight (7–11 weeks old) was the outcome. To examine whether type I error rate could be affected by choice of statistical tests, we adjusted the empirical distributions of weights to ensure the null hypothesis (i.e., no mean difference) in two ways: Case 1) center both weight distributions on the same mean weight; Case 2) combine data from control and mutant groups into one distribution. From these cases, 3 to 20 mice were resampled to create a ‘plasmode’ dataset. We performed five common tests (Student's t-test, Welch's t-test, Wilcoxon test, permutation test and bootstrap test) on the plasmodes and computed type I error rates. Power was assessed using plasmodes, where the distribution of the control group was shifted by adding a constant value as in Case 1, but to realize nominal effect sizes. Results Type I error rates were unreasonably higher than the nominal significance level (type I error rate inflation) for Student's t-test, Welch's t-test and permutation especially when sample size was small for Case 1, whereas inflation was observed only for permutation for Case 2. Deflation was noted for bootstrap with small sample. Increasing sample size mitigated inflation and deflation, except for Wilcoxon in Case 1 because heterogeneity of weight distributions between groups violated assumptions for the purposes of testing mean differences. For power, a departure from the reference value was observed with small samples. Compared with the other tests, bootstrap was underpowered with small samples as a tradeoff for maintaining type I error rates. Conclusions With small samples (n ≤ 5), bootstrap avoided type I error rate inflation, but often at the cost of lower power. To avoid type I error rate inflation for other tests, sample size should be increased. Wilcoxon should be avoided because of heterogeneity of weight distributions between mutant and control mice. Funding Sources This study was supported in part by NIH and Japan Society for Promotion of Science (JSPS) KAKENHI grant.


1999 ◽  
Vol 11 (8) ◽  
pp. 1885-1892 ◽  
Author(s):  
Ethem Alpaydm

Dietterich (1998) reviews five statistical tests and proposes the 5 × 2 cvt test for determining whether there is a significant difference between the error rates of two classifiers. In our experiments, we noticed that the 5 × 2 cvt test result may vary depending on factors that should not affect the test, and we propose a variant, the combined 5 × 2 cv F test, that combines multiple statistics to get a more robust test. Simulation results show that this combined version of the test has lower type I error and higher power than 5 × 2 cv proper.


Methodology ◽  
2012 ◽  
Vol 8 (1) ◽  
pp. 1-11 ◽  
Author(s):  
John Ruscio ◽  
Brendan Roche

Parametric assumptions for statistical tests include normality and equal variances. Micceri (1989) found that data frequently violate the normality assumption; variances have received less attention. We recorded within-group variances of dependent variables for 455 studies published in leading psychology journals. Sample variances differed, often substantially, suggesting frequent violation of the assumption of equal population variances. Parallel analyses of equal-variance artificial data otherwise matched to the characteristics of the empirical data show that unequal sample variances in the empirical data exceed expectations from normal sampling error and can adversely affect Type I error rates of parametric statistical tests. Variance heterogeneity was unrelated to relative group sizes or total sample size and observed across subdisciplines of psychology in experimental and correlational research. These results underscore the value of examining variances and, when appropriate, using data-analytic methods robust to unequal variances. We provide a standardized index for examining and reporting variance heterogeneity.


2020 ◽  
Vol 10 (18) ◽  
pp. 6247
Author(s):  
Hanan M. Hammouri ◽  
Roy T. Sabo ◽  
Rasha Alsaadawi ◽  
Khalid A. Kheirallah

Scientists in biomedical and psychosocial research need to deal with skewed data all the time. In the case of comparing means from two groups, the log transformation is commonly used as a traditional technique to normalize skewed data before utilizing the two-group t-test. An alternative method that does not assume normality is the generalized linear model (GLM) combined with an appropriate link function. In this work, the two techniques are compared using Monte Carlo simulations; each consists of many iterations that simulate two groups of skewed data for three different sampling distributions: gamma, exponential, and beta. Afterward, both methods are compared regarding Type I error rates, power rates and the estimates of the mean differences. We conclude that the t-test with log transformation had superior performance over the GLM method for any data that are not normal and follow beta or gamma distributions. Alternatively, for exponentially distributed data, the GLM method had superior performance over the t-test with log transformation.


1980 ◽  
Vol 5 (4) ◽  
pp. 337-349 ◽  
Author(s):  
Philip H. Ramsey

It is noted that disagreements have arisen in the literature about the robustness of the t test in normal populations with unequal variances. Hsu's procedure is applied to determine exact Type I error rates for t. Employing fairly liberal but objective standards for assessing robustness, it is shown that the t test is not always robust to the assumption of equal population variances even when sample sizes are equal. Several guidelines are suggested including the point that to apply t at α = .05 without regard for unequal variances would require equal sample sizes of at least 15 by one of the standards considered. In many cases, especially those with unequal N's, an alternative such as Welch's procedure is recommended.


2021 ◽  
Vol 11 (2) ◽  
pp. 62
Author(s):  
I-Shiang Tzeng

Significance analysis of microarrays (SAM) provides researchers with a non-parametric score for each gene based on repeated measurements. However, it may lose certain power in general statistical tests to correctly detect differentially expressed genes (DEGs) which violate homogeneity. Monte Carlo simulation shows that the “half SAM score” can maintain type I error rates of about 0.05 based on assumptions of normal and non-normal distributions. The author found 265 DEGs using the half SAM scoring, more than the 119 DEGs detected by SAM, with the false discovery rate controlled at 0.05. In conclusion, the author recommends the half SAM scoring method to detect DEGs in data that show heterogeneity.


Sign in / Sign up

Export Citation Format

Share Document