scholarly journals Vankeusaikaisten kuntoutusohjelmien vaikutusarvioinnit ja arvioitavuus

Kriminologia ◽  
2021 ◽  
Vol 1 (1) ◽  
pp. 39-59
Author(s):  
Sasu Tyni ◽  
Mikko Aaltonen

Suomalaisissa vankiloissa on tehty systemaattista ohjelmatoimintaa noin 20 vuoden ajan. Ohjelmatoiminnan tärkeimpänä tavoitteena on edistää rikoksetonta elämäntapaa, ja siten vähentää vapautumisen jälkeistä uusintarikollisuutta. Uusintarikollisuusvaikutusten todentaminen edellyttää vaikutustutkimuksiin perustuvaa tutkimusnäyttöä, jota on vähitellen alkanut kertymään myös Suomesta. Tässä artikkelissa käydään läpi yksittäisten ohjelmien uusintarikollisuusvaikutuksia koskevien kotimaisten arviointitutkimusten keskeiset tulokset, sekä pohditaan ohjelmatoiminnan vaikutusten arvioinnin mahdollisuuksia tilastollisten tutkimusmenetelmien, tutkimusaineistojen ja ohjelmatoiminnan volyymin näkökulmista. Ohjelmatoiminnan vaikutuksista uusintarikollisuuteen Suomessa ei ole toistaiseksi saatu selvää näyttöä. Vaikka ohjelmien vaikutusten tutkimiseen tarvittavat aineistot ovatkin parantuneet 2000-luvulla selvästi, uskottavaa arviointia vaikeuttavat edelleen hyvien tutkimusasetelmien puute sekä yksittäisten ohjelmien pienet osallistujamäärät. Vaikutusarviointien menetelmällisessä laadussa on parantamisen varaa. Koska yksittäisen ohjelman vaikutus uusintarikollisuuteen on saatavilla olevan tutkimustiedon valossa luultavasti aika maltillinen, kovin pienillä tutkimusaineistoilla vaikutuksia ei kannata jatkossa yrittää tilastollisesti tutkia. Pienet vaikutukset eivät kuitenkaan tarkoita sitä, että ohjelmat eivät voisi olla kannattavia ja kustannustehokkaita. Merkittävä osa nykyisestä ohjelmatoiminnasta on osallistujamääriltään niin pienimuotoista, ettei niiden toimivuudesta tai toimimattomuudesta ole helppo saada luotettavaa tietoa. Tämä tilanne haastaa tutkimustietoon perustuvan ohjelmatoiminnan kehittämistä.   Sasu Tyni and Mikko Aaltonen: Evaluation research on rehabilitation programs in prison. Rehabilitation programs have been used systematically in Finnish prisons now for two decades. The main aim of these programs is to promote desistance and reduce recidivism after release. Several Finnish evaluations of prison programs have been published during the recent years. In this article, we start by reviewing the key results of these studies and consider the possibilities and limitations of quasi-experimental program evaluation in the light of available register data and the scale of program uptake. So far none of the Finnish evaluations have shown evidence of programs reducing recidivism. Even though the datasets needed for evaluation have improved clearly in the 21st century, credible evaluation of programs’ causal effects is still hampered by lack of strong research designs and low participant rates in most programs. The methodological quality of evaluations should be improved. Given that the available evidence suggests that true effect sizes of programs are likely to be relatively small, large samples are needed to attain enough statistical power to detect effects or their absence. At the same time, even if true effect sizes are small, the programs can still be cost-effective. The low numbers of participants in most prison programs present a challenge to evidence-based program development. Keywords: prison rehabilitation – recidivism – program evaluation – statistical power – causal inference

Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2020 ◽  
Vol 31 (2) ◽  
pp. 330-341
Author(s):  
Charlotte A. Zeamer

This article describes the unique benefits of discourse analysis, a qualitative sociolinguistic research methodology, for evaluating financial literacy counseling. The methodology is especially promising for organizations that may lack the resources to implement “gold standard” large scale, randomized, experimental, or quasi-experimental longitudinal designs. We begin with an overview of problems with program evaluation research on financial literacy interventions, particularly for smaller community service agencies. We lay out the advantages of discourse analysis as an alternative method of assessing program quality. We include a pilot study demonstrating the use of the research approach, and we conclude the description of this study with specific guidelines as to “best practices” indicated from the results. We believe discourse analysis has the potential to make data collection and analysis easier and more effective for counselors and agency staff at community service organizations, especially when the work of program evaluation is being done by the service providers themselves and the client needs may be atypical, complex, or very specific.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3544 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degradingp-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take smallp-values at face value, but mistrust results with largerp-values. In either case,p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging,p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher,p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also largerp-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of largerp-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or thatp-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


2011 ◽  
Vol 32 (4) ◽  
pp. 480-493 ◽  
Author(s):  
Omolola A. Adedokun ◽  
Amy L. Childress ◽  
Wilella D. Burgess

A theory-driven approach to evaluation (TDE) emphasizes the development and empirical testing of conceptual models to understand the processes and mechanisms through which programs achieve their intended goals. However, most reported applications of TDE are limited to large-scale experimental/quasi-experimental program evaluation designs. Very few (limited) examples of the relevance of TDE to nonexperimental program evaluation designs exist in literature. Using the method of structural equation modeling to analyze data from the Interns for Indiana (IfI) program, this study demonstrates how evaluation practitioners can test logical and sequential relationships among tiers of outcomes of nonexperimental programs, especially programs with limited datasets. The study also describes how the empirical feedback can be used to understand program dynamics and improve program implementation and evaluation.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


2019 ◽  
Vol 227 (4) ◽  
pp. 261-279 ◽  
Author(s):  
Frank Renkewitz ◽  
Melanie Keiner

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.


2019 ◽  
Vol 50 (5-6) ◽  
pp. 292-304 ◽  
Author(s):  
Mario Wenzel ◽  
Marina Lind ◽  
Zarah Rowland ◽  
Daniela Zahn ◽  
Thomas Kubiak

Abstract. Evidence on the existence of the ego depletion phenomena as well as the size of the effects and potential moderators and mediators are ambiguous. Building on a crossover design that enables superior statistical power within a single study, we investigated the robustness of the ego depletion effect between and within subjects and moderating and mediating influences of the ego depletion manipulation checks. Our results, based on a sample of 187 participants, demonstrated that (a) the between- and within-subject ego depletion effects only had negligible effect sizes and that there was (b) large interindividual variability that (c) could not be explained by differences in ego depletion manipulation checks. We discuss the implications of these results and outline a future research agenda.


2021 ◽  
Vol 25 (1) ◽  
pp. 101538
Author(s):  
Diego Feriani ◽  
Ercilia Evangelista Souza ◽  
Larissa Gordilho Mutti Carvalho ◽  
Aline Santos Ibanes ◽  
Eliana Vasconcelos ◽  
...  

2021 ◽  
Vol 36 (1) ◽  
Author(s):  
Kathryn J. DeShaw ◽  
Laura D. Ellingson ◽  
Laura Liechty ◽  
Gabriella M. McLoughlin ◽  
Gregory J. Welk

This study assessed a brief 6-week motivational interviewing (MI) training program for extension field specialists (EFS) involved in supporting a statewide school wellness initiative called SWITCH. A total of 16EFS were instructed in MI principles to support the programming and half (n = 8) volunteered to participate in the hybrid (online and in-person) MI training program. Phone calls between EFS and school staff involved in SWITCH were recorded and coded using the Motivational Interviewing Treatment Integrity (MITI) system to capture data on utilization of MI principles. Differences in MI utilization between the trained (n=8) and untrained (n=8) EFS were evaluated using Cohen’s d effect sizes. Results revealed large differences for technical global scores (d=1.5) and moderate effect sizes for relational global components (d=0.76) between the two groups. This naturalistic, quasi-experimental study indicates a brief MI training protocol is effective for teaching the spirit and relational components of MI to EFS.


Sign in / Sign up

Export Citation Format

Share Document