Underreporting in Political Science Survey Experiments: Comparing Questionnaires to Published Results

2015 ◽  
Vol 23 (2) ◽  
pp. 306-312 ◽  
Author(s):  
Annie Franco ◽  
Neil Malhotra ◽  
Gabor Simonovits

The accuracy of published findings is compromised when researchers fail to report and adjust for multiple testing. Preregistration of studies and the requirement of preanalysis plans for publication are two proposed solutions to combat this problem. Some have raised concerns that such changes in research practice may hinder inductive learning. However, without knowing the extent of underreporting, it is difficult to assess the costs and benefits of institutional reforms. This paper examines published survey experiments conducted as part of the Time-sharing Experiments in the Social Sciences program, where the questionnaires are made publicly available, allowing us to compare planned design features against what is reported in published research. We find that: (1) 30% of papers report fewer experimental conditions in the published paper than in the questionnaire; (2) roughly 60% of papers report fewer outcome variables than what are listed in the questionnaire; and (3) about 80% of papers fail to report all experimental conditions and outcomes. These findings suggest that published statistical tests understate the probability of type I errors.

2021 ◽  
Author(s):  
Quentin André

When researchers choose to identify and exclude outliers from their data, should they do so across all the data, or within experimental conditions? A survey of recent papers published in the Journal of Experimental Psychology: General shows that both methods are widely used, and common data visualization techniques suggest that outliers should be excluded at the condition-level. However, I highlight in the present paper that removing outliers by condition runs against the logic of hypothesis testing, and that this practice leads to unacceptable increases in false-positive rates. I demonstrate that this conclusion holds true across a variety of statistical tests, exclusion criterion and cutoffs, sample sizes, and data types, and show in simulated experiments and in a re-analysis of existing data that by-condition exclusions can result in false-positive rates as high as 43%. I finally demonstrate that by-condition exclusions are a specific case of a more general issue: Any outlier exclusion procedure that is not blind to the hypothesis that researchers want to test may result in inflated Type I errors. I conclude by offering best practices and recommendations for excluding outliers.


2015 ◽  
Vol 2015 ◽  
pp. 1-5
Author(s):  
Wararit Panichkitkosolkul

An asymptotic test and an approximate test for the reciprocal of a normal mean with a known coefficient of variation were proposed in this paper. The asymptotic test was based on the expectation and variance of the estimator of the reciprocal of a normal mean. The approximate test used the approximate expectation and variance of the estimator by Taylor series expansion. A Monte Carlo simulation study was conducted to compare the performance of the two statistical tests. Simulation results showed that the two proposed tests performed well in terms of empirical type I errors and power. Nevertheless, the approximate test was easier to compute than the asymptotic test.


Author(s):  
Jinsong Chen ◽  
Mark J. van der Laan ◽  
Martyn T. Smith ◽  
Alan E. Hubbard

Microarray studies often need to simultaneously examine thousands of genes to determine which are differentially expressed. One main challenge in those studies is to find suitable multiple testing procedures that provide accurate control of the error rates of interest and meanwhile are most powerful, that is, they return the longest list of truly interesting genes among competitors. Many multiple testing methods have been developed recently for microarray data analysis, especially resampling based methods, such as permutation methods, the null-centered and scaled bootstrap (NCSB) method, and the quantile-transformed-bootstrap-distribution (QTBD) method. Each of these methods has its own merits and limitations. Theoretically permutation methods can fail to provide accurate control of Type I errors when the so-called subset pivotality condition is violated. The NCSB method does not suffer from that limitation, but an impractical number of bootstrap samples are often needed to get proper control of Type I errors. The newly developed QTBD method has the virtues of providing accurate control of Type I errors under few restrictions. However, the relative practical performance of the above three types of multiple testing methods remains unresolved. This paper compares the above three resampling based methods according to the control of family wise error rates (FWER) through data simulations. Results show that among the three resampling based methods, the QTBD method provides relatively accurate and powerful control in more general circumstances.


2004 ◽  
Vol 3 (1) ◽  
pp. 1-25 ◽  
Author(s):  
Mark J. van der Laan ◽  
Sandrine Dudoit ◽  
Katherine S. Pollard

This article shows that any single-step or stepwise multiple testing procedure (asymptotically) controlling the family-wise error rate (FWER) can be augmented into procedures that (asymptotically) control tail probabilities for the number of false positives and the proportion of false positives among the rejected hypotheses. Specifically, given any procedure that (asymptotically) controls the FWER at level alpha, we propose simple augmentation procedures that provide (asymptotic) level-alpha control of: (i) the generalized family-wise error rate, i.e., the tail probability, gFWER(k), that the number of Type I errors exceeds a user-supplied integer k, and (ii) the tail probability, TPPFP(q), that the proportion of Type I errors among the rejected hypotheses exceeds a user-supplied value 0


2022 ◽  
Vol 29 (1) ◽  
pp. 1-70
Author(s):  
Radu-Daniel Vatavu ◽  
Jacob O. Wobbrock

We clarify fundamental aspects of end-user elicitation, enabling such studies to be run and analyzed with confidence, correctness, and scientific rigor. To this end, our contributions are multifold. We introduce a formal model of end-user elicitation in HCI and identify three types of agreement analysis: expert , codebook , and computer . We show that agreement is a mathematical tolerance relation generating a tolerance space over the set of elicited proposals. We review current measures of agreement and show that all can be computed from an agreement graph . In response to recent criticisms, we show that chance agreement represents an issue solely for inter-rater reliability studies and not for end-user elicitation, where it is opposed by chance disagreement . We conduct extensive simulations of 16 statistical tests for agreement rates, and report Type I errors and power. Based on our findings, we provide recommendations for practitioners and introduce a five-level hierarchy for elicitation studies.


Author(s):  
Thomas Verron ◽  
Xavier Cahours ◽  
Stéphane Colard

SummaryDuring the last two decades, an increase of tobacco product reporting requirements from regulators was observed, such as Europe, Canada or USA.However, the capacity to compare and discriminate accurately two products is impacted by the number of constituents used for the comparison. Indeed, performing a large number of simultaneous independent hypothesis tests increases the probability of rejection of the null hypothesis when it should not be rejected. This leads to virtually guarantee the presence of type I errors among the findings. Correction methods have been developed to overcome this issue like the Bonferroni or Benjamini & Hochberg ones. The performance of these methods was assessed by comparing identical tobacco products with different sizes of data sets. Results showed that multiple comparisons lead to erroneous conclusions if the risk of type I error is not corrected. Unfortunately, reducing the type I error impacts the statistical power of the tests. Consequently, strategies for dealing with multiplicity of data should provide a reasonable balance between testing requirement and statistical power of differentiation. Multiple testing for product comparison is less of a problem if studies restrict to the most relevant parameters for comparison.


2021 ◽  
Vol 17 (12) ◽  
pp. e1009036
Author(s):  
Jack Kuipers ◽  
Ariane L. Moore ◽  
Katharina Jahn ◽  
Peter Schraml ◽  
Feng Wang ◽  
...  

Tumour progression is an evolutionary process in which different clones evolve over time, leading to intra-tumour heterogeneity. Interactions between clones can affect tumour evolution and hence disease progression and treatment outcome. Intra-tumoural pairs of mutations that are overrepresented in a co-occurring or clonally exclusive fashion over a cohort of patient samples may be suggestive of a synergistic effect between the different clones carrying these mutations. We therefore developed a novel statistical testing framework, called GeneAccord, to identify such gene pairs that are altered in distinct subclones of the same tumour. We analysed our framework for calibration and power. By comparing its performance to baseline methods, we demonstrate that to control type I errors, it is essential to account for the evolutionary dependencies among clones. In applying GeneAccord to the single-cell sequencing of a cohort of 123 acute myeloid leukaemia patients, we find 1 clonally co-occurring and 8 clonally exclusive gene pairs. The clonally exclusive pairs mostly involve genes of the key signalling pathways.


1978 ◽  
Vol 46 (1) ◽  
pp. 211-218
Author(s):  
Louis M. Hsu

The problem of controlling the risk of occurrence of at least one Type I Error in a family of n statistical tests has been discussed extensively in psychological literature. However, the more general problem of controlling the probability of occurrence of more than some maximum (not necessarily zero) tolerable number ( xm) of Type I Errors in such a family appears to have received little attention. The present paper presents a simple Poisson approximation to the significance level P( EI) which should be used per test, to achieve this goal, in a family of n independent tests. The cases of equal and unequal significance levels for the n tests are discussed. Relative merits and limitations of the Poisson and Bonferroni methods of controlling the number of Type I Errors are examined, and application of the Poisson method to tests of orthogonal contrasts in analysis of variance, multiple tests of hypotheses in single studies, and multiple tests of hypotheses in literature reviews, are discussed.


F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 962 ◽  
Author(s):  
Judith ter Schure ◽  
Peter Grünwald

Studies accumulate over time and meta-analyses are mainly retrospective. These two characteristics introduce dependencies between the analysis time, at which a series of studies is up for meta-analysis, and results within the series. Dependencies introduce bias — Accumulation Bias — and invalidate the sampling distribution assumed for p-value tests, thus inflating type-I errors. But dependencies are also inevitable, since for science to accumulate efficiently, new research needs to be informed by past results. Here, we investigate various ways in which time influences error control in meta-analysis testing. We introduce an Accumulation Bias Framework that allows us to model a wide variety of practically occurring dependencies including study series accumulation, meta-analysis timing, and approaches to multiple testing in living systematic reviews. The strength of this framework is that it shows how all dependencies affect p-value-based tests in a similar manner. This leads to two main conclusions. First, Accumulation Bias is inevitable, and even if it can be approximated and accounted for, no valid p-value tests can be constructed. Second, tests based on likelihood ratios withstand Accumulation Bias: they provide bounds on error probabilities that remain valid despite the bias. We leave the reader with a choice between two proposals to consider time in error control: either treat individual (primary) studies and meta-analyses as two separate worlds — each with their own timing — or integrate individual studies in the meta-analysis world. Taking up likelihood ratios in either approach allows for valid tests that relate well to the accumulating nature of scientific knowledge. Likelihood ratios can be interpreted as betting profits, earned in previous studies and invested in new ones, while the meta-analyst is allowed to cash out at any time and advice against future studies.


Sign in / Sign up

Export Citation Format

Share Document