Since the start of the War on Poverty in the 1960s, social scientists have developed and refined experimental and quasi-experimental methods for evaluating and understanding the ways in which public policies, programs, and interventions affect people’s lives. The overarching mission of many social scientists is to understand “what works” in education and social policy. These are causal questions about whether an intervention, practice, program, or policy affects some outcome of interest. Although causal questions are not the only relevant questions in program evaluation, they are assumed by many in the fields of public health, economics, social policy, and now education to be the scientific foundation for evidence-based decision making. Fortunately, over the last half-century, two methodological advances have improved the rigor of social science approaches for making causal inferences. The first was acknowledging the primacy of research designs over statistical adjustment procedures. Donald Campbell and colleagues showed how research designs could be used to address many plausible threats to validity. The second methodological advancement was the use of potential outcomes to specify exact causal quantities of interest. This allowed researchers to think systematically about research design assumptions and to develop diagnostic measures for assessing when these assumptions are met. This article reviews important statistical methods for estimating the impact of interventions on outcomes in education settings, particularly programs that are implemented in field, rather than laboratory, settings. We begin by describing the causal inference challenge for evaluating program effects. Then four research designs are discussed that may be used for estimating program impacts. The article highlights what the Campbell tradition identifies as the strongest causal research designs: the randomized experiment and the regression-discontinuity designs. These approaches have the advantage of transparent assumptions for yielding causal effects. The article then discusses weaker but more commonly used approaches estimating effects, including the interrupted time series and the non-equivalent comparison group designs. For the interrupted time series design, differences-in-differences are discussed as a more generalized approach to time series methods; for non-equivalent comparison group designs, the article highlights propensity score matching as a method for creating statistically equivalent groups on the basis of observed covariates. For each research design, references are included that discuss the underlying theory and logic of the method, exemplars of the approach in field settings, and recent methodological extensions to the design. The article concludes with a discussion of practical considerations for evaluating interventions in field settings, including the external validity of estimated effects from impact studies.