scholarly journals Are Difficult-To-Study Populations too Difficult to Study in a Reliable Way?

2020 ◽  
Vol 25 (1) ◽  
pp. 41-50 ◽  
Author(s):  
Florian Lange

Abstract. Replication studies, pre-registration, and increases in statistical power will likely improve the reliability of scientific evidence. However, these measures face critical limitations in populations that are inherently difficult to study. Members of difficult-to-study populations (e.g., patients, children, non-human animals) are less accessible to researchers, which typically results in small-sample studies that are infeasible to replicate. Nevertheless, meta-analyses on clinical neuropsychological data suggest that difficult-to-study populations can be studied in a reliable way. These analyses often produce unbiased effect-size estimates despite aggregating across severely underpowered original studies. This finding can be attributed to a neuropsychological research culture involving the non-selective reporting of results from standardized and validated test procedures. Consensus guidelines, test manuals, and psychometric evidence constrain the methodological choices made by neuropsychologists, who regularly report the results from neuropsychological test batteries irrespective of their statistical significance or novelty. Comparable shifts toward more standardization and validation, complete result reports, and between-lab collaborations can allow for a meaningful and reliable study of psychological phenomena in other difficult-to-study populations.

2019 ◽  
Author(s):  
Francesco Margoni ◽  
Martin Shepperd

Infant research is making considerable progresses. However, among infant researchers there is growing concern regarding the widespread habit of undertaking studies that have small sample sizes and employ tests with low statistical power (to detect a wide range of possible effects). For many researchers, issues of confidence may be partially resolved by relying on replications. Here, we bring further evidence that the classical logic of confirmation, according to which the result of a replication study confirms the original finding when it reaches statistical significance, could be usefully abandoned. With real examples taken from the infant literature and Monte Carlo simulations, we show that a very wide range of possible replication results would in a formal statistical sense constitute confirmation as they can be explained simply due to sampling error. Thus, often no useful conclusion can be derived from a single or small number of replication studies. We suggest that, in order to accumulate and generate new knowledge, the dichotomous view of replication as confirmatory/disconfirmatory can be replaced by an approach that emphasizes the estimation of effect sizes via meta-analysis. Moreover, we discuss possible solutions for reducing problems affecting the validity of conclusions drawn from meta-analyses in infant research.


2021 ◽  
pp. bjophthalmol-2021-319067
Author(s):  
Felix Friedrich Reichel ◽  
Stylianos Michalakis ◽  
Barbara Wilhelm ◽  
Ditta Zobor ◽  
Regine Muehlfriedel ◽  
...  

AimsTo determine long-term safety and efficacy outcomes of a subretinal gene therapy for CNGA3-associated achromatopsia. We present data from an open-label, nonrandomised controlled trial (NCT02610582).MethodsDetails of the study design have been previously described. Briefly, nine patients were treated in three escalating dose groups with subretinal AAV8.CNGA3 gene therapy between November 2015 and October 2016. After the first year, patients were seen on a yearly basis. Safety assessment constituted the primary endpoint. On a secondary level, multiple functional tests were carried out to determine efficacy of the therapy.ResultsNo adverse or serious adverse events deemed related to the study drug occurred after year 1. Safety of the therapy, as the primary endpoint of this trial, can, therefore, be confirmed. The functional benefits that were noted in the treated eye at year 1 were persistent throughout the following visits at years 2 and 3. While functional improvement in the treated eye reached statistical significance for some secondary endpoints, for most endpoints, this was not the case when the treated eye was compared with the untreated fellow eye.ConclusionThe results demonstrate a very good safety profile of the therapy even at the highest dose administered. The small sample size limits the statistical power of efficacy analyses. However, trial results inform on the most promising design and endpoints for future clinical trials. Such trials have to determine whether treatment of younger patients results in greater functional gains by avoiding amblyopia as a potential limiting factor.


2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Ranganathan Natarajan ◽  
Bohdan Pechenyak ◽  
Usha Vyas ◽  
Pari Ranganathan ◽  
Alan Weinberg ◽  
...  

Background. Primary goal of this randomized, double-blind, placebo-controlled crossover study of Renadyl in end-stage renal disease patients was to assess the safety and efficacy of Renadyl measured through improvement in quality of life or reduction in levels of known uremic toxins. Secondary goal was to investigate the effects on several biomarkers of inflammation and oxidative stress.Methods. Two 2-month treatment periods separated by 2-month washout and crossover, with physical examinations, venous blood testing, and quality of life questionnaires completed at each visit. Data were analyzed with SAS V9.2.Results. 22 subjects (79%) completed the study. Observed trends were as follows (none reaching statistical significance): decline in WBC count(-0.51×109/L,P=0.057)and reductions in levels of C-reactive protein(-8.61 mg/L,P=0.071)and total indoxyl glucuronide(-0.11 mg%,P=0.058). No statistically significant changes were observed in other uremic toxin levels or measures of QOL.Conclusions. Renadyl appeared to be safe to administer to ESRD patients on hemodialysis. Stability in QOL assessment is an encouraging result for a patient cohort in such advanced stage of kidney disease. Efficacy could not be confirmed definitively, primarily due to small sample size and low statistical power—further studies are warranted.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e10131
Author(s):  
Jonas Tebbe ◽  
Emily Humble ◽  
Martin Adam Stoffel ◽  
Lisa Johanna Tewes ◽  
Caroline Müller ◽  
...  

Replication studies are essential for evaluating the validity of previous research findings. However, it has proven challenging to reproduce the results of ecological and evolutionary studies, partly because of the complexity and lability of many of the phenomena being investigated, but also due to small sample sizes, low statistical power and publication bias. Additionally, replication is often considered too difficult in field settings where many factors are beyond the investigator’s control and where spatial and temporal dependencies may be strong. We investigated the feasibility of reproducing original research findings in the field of chemical ecology by performing an exact replication of a previous study of Antarctic fur seals (Arctocephalus gazella). In the original study, skin swabs from 41 mother-offspring pairs from two adjacent breeding colonies on Bird Island, South Georgia, were analyzed using gas chromatography-mass spectrometry. Seals from the two colonies differed significantly in their chemical fingerprints, suggesting that colony membership may be chemically encoded, and mothers were also chemically similar to their pups, hinting at the possible involvement of phenotype matching in mother-offspring recognition. In the current study, we generated and analyzed chemical data from a non-overlapping sample of 50 mother-offspring pairs from the same two colonies 5 years later. The original results were corroborated in both hypothesis testing and estimation contexts, with p-values remaining highly significant and effect sizes, standardized between studies by bootstrapping the chemical data over individuals, being of comparable magnitude. However, exact replication studies are only capable of showing whether a given effect can be replicated in a specific setting. We therefore investigated whether chemical signatures are colony-specific in general by expanding the geographic coverage of our study to include pups from a total of six colonies around Bird Island. We detected significant chemical differences in all but a handful of pairwise comparisons between colonies. This finding adds weight to our original conclusion that colony membership is chemically encoded, and suggests that chemical patterns of colony membership not only persist over time but can also be generalized over space. Our study systematically confirms and extends our previous findings, while also implying more broadly that spatial and temporal heterogeneity need not necessarily negate the reproduction and generalization of ecological research findings.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3544 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degradingp-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take smallp-values at face value, but mistrust results with largerp-values. In either case,p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging,p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher,p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also largerp-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of largerp-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or thatp-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


2020 ◽  
Author(s):  
Jonas Tebbe ◽  
Emily Humble ◽  
Martin A. Stoffel ◽  
Lisa J. Tewes ◽  
Caroline Müller ◽  
...  

AbstractReplication studies are essential for assessing the validity of previous research findings and for probing their generality. However, it has proven challenging to reproduce the results of ecological and evolutionary studies, partly because of the complexity and lability of many of the phenomena being investigated, but also due to small sample sizes, low statistical power and publication bias. Additionally, replication is often considered too difficult in field settings where many factors are beyond the investigator’s control and where spatial and temporal dependencies may be strong. We investigated the feasibility of reproducing original research findings in the field of chemical ecology by attempting to replicate a previous study by our team on Antarctic fur seals (Arctocephalus gazella). In the original study, skin swabs from 41 mother-offspring pairs from two adjacent breeding colonies on Bird Island, South Georgia, were analysed using gas chromatography-mass spectrometry. Seals from the two colonies differed significantly in their chemical fingerprints, suggesting that colony membership may be chemically encoded, and mothers were also chemically similar to their pups, implying that phenotype matching may be involved in mother-offspring recognition. Here, we generated and analysed comparable chemical data from a non-overlapping sample of 50 mother-offspring pairs from the same two colonies five years later. The original results were corroborated in both hypothesis testing and estimation contexts, with p-values remaining highly significant and effect sizes, standardized between studies by bootstrapping the chemical data over individuals, being of comparable magnitude. We furthermore expanded the geographic coverage of our study to include pups from a total of six colonies around Bird Island. Significant chemical differences were observed in the majority of pairwise comparisons, indicating not only that patterns of colony membership persist over time, but also that chemical signatures are colony-specific in general. Our study systematically confirms and extends our previous findings, while also implying that temporal and spatial heterogeneity need not necessarily negate the reproduction and generalization of ecological research findings.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


2016 ◽  
Vol 54 (2) ◽  
pp. 260-280 ◽  
Author(s):  
Charlotte Sonne ◽  
Jessica Carlsson ◽  
Per Bech ◽  
Erik Lykke Mortensen

There is a dearth of evidence on the effectiveness of pharmacological treatment for refugees with trauma-related disorders. The present paper provides an overview of available literature on the subject and discusses the transferability of results from studies on other groups of patients with post traumatic stress disorder (PTSD). We conducted a systematic review of published treatment outcome studies on PTSD and depression among refugees. Fifteen studies were identified and reviewed. Most studies focused on the use of antidepressants. Included studies differed widely in method and quality. The majority were observational studies and case studies. Small sample sizes limited the statistical power. Few studies reported effect sizes, confidence intervals, and statistical significance of findings. No specific pharmacological treatment for PTSD among refugees can be recommended on the basis of the available literature. There is a need for well-designed clinical trials, especially with newer antidepressants and antipsychotics. Until such studies are available, clinical practice and design of trials can be guided by results from studies of other groups of PTSD patients, although differences in pharmacogenetics, compliance, and trauma reactions may affect the direct transferability of results from studies on nonrefugee populations.


2020 ◽  
Vol 228 (1) ◽  
pp. 43-49 ◽  
Author(s):  
Michael Kossmeier ◽  
Ulrich S. Tran ◽  
Martin Voracek

Abstract. Currently, dedicated graphical displays to depict study-level statistical power in the context of meta-analysis are unavailable. Here, we introduce the sunset (power-enhanced) funnel plot to visualize this relevant information for assessing the credibility, or evidential value, of a set of studies. The sunset funnel plot highlights the statistical power of primary studies to detect an underlying true effect of interest in the well-known funnel display with color-coded power regions and a second power axis. This graphical display allows meta-analysts to incorporate power considerations into classic funnel plot assessments of small-study effects. Nominally significant, but low-powered, studies might be seen as less credible and as more likely being affected by selective reporting. We exemplify the application of the sunset funnel plot with two published meta-analyses from medicine and psychology. Software to create this variation of the funnel plot is provided via a tailored R function. In conclusion, the sunset (power-enhanced) funnel plot is a novel and useful graphical display to critically examine and to present study-level power in the context of meta-analysis.


Sign in / Sign up

Export Citation Format

Share Document