Statistical Thinking from Scratch
Latest Publications


TOTAL DOCUMENTS

13
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780198827627, 9780191866463

Author(s):  
M. D. Edge

Nonparametric and semiparametric statistical methods assume models whose properties cannot be described by a finite number of parameters. For example, a linear regression model that assumes that the disturbances are independent draws from an unknown distribution is semiparametric—it includes the intercept and slope as regression parameters but has a nonparametric part, the unknown distribution of the disturbances. Nonparametric and semiparametric methods focus on the empirical distribution function, which, assuming that the data are really independent observations from the same distribution, is a consistent estimator of the true cumulative distribution function. In this chapter, with plug-in estimation and the method of moments, functionals or parameters are estimated by treating the empirical distribution function as if it were the true cumulative distribution function. Such estimators are consistent. To understand the variation of point estimates, bootstrapping is used to resample from the empirical distribution function. For hypothesis testing, one can either use a bootstrap-based confidence interval or conduct a permutation test, which can be designed to test null hypotheses of independence or exchangeability. Resampling methods—including bootstrapping and permutation testing—are flexible and easy to implement with a little programming expertise.



Author(s):  
M. D. Edge

One way to visualize a set of data on two variables is to plot them on a pair of axes. A line that “best fits” the data can then be drawn as a summary. This chapter considers how to define a line of “best” fit—there is no sole best choice. The most commonly chosen line to summarize the data is the “least-squares” line—the line that minimizes the sum of the squared vertical distances between the points and the line. One reason for the least-squares line’s popularity is convenience, but, as will be seen later, it is also related to some key ideas in statistical estimation. The derivations of expressions for the intercept and slope of the least-squares line are discussed.



Author(s):  
M. D. Edge

Statistics is concerned with using data to learn about the world. In this book, concepts for reasoning from data are developed using a combination of math and simulation. Using a running example, we will consider probability theory, statistical estimation, and statistical inference. Estimation and inference will be considered from three different perspectives.



Author(s):  
M. D. Edge

Interval estimation is the attempt to define intervals that quantify the degree of uncertainty in an estimate. The standard deviation of an estimate is called a standard error. Confidence intervals are designed to cover the true value of an estimand with a specified probability. Hypothesis testing is the attempt to assess the degree of evidence for or against a specific hypothesis. One tool for frequentist hypothesis testing is the p value, or the probability that if the null hypothesis is in fact true, the data would depart as extremely or more extremely from expectations under the null hypothesis than they were observed to do. In Neyman–Pearson hypothesis testing, the null hypothesis is rejected if p is less than a pre-specified value, often chosen to be 0.05. A test’s power function gives the probability that the null hypothesis is rejected given the significance level γ‎, a sample size n, and a specified alternative hypothesis. This chapter discusses some limitations of hypothesis testing as commonly practiced in the research literature.



Author(s):  
M. D. Edge

This chapter considers the rules of probability. Probabilities are non-negative, they sum to one, and the probability that either of two mutually exclusive events occurs is the sum of the probability of the two events. Two events are said to be independent if the probability that they both occur is the product of the probabilities that each event occurs. Bayes’ theorem is used to update probabilities on the basis of new information, and it is shown that the conditional probabilities P(A|B) and P(B|A) are not the same. Finally, the chapter discusses ways in which distributions of random variables can be described, using probability mass functions for discrete random variables and probability density functions for continuous random variables.



Author(s):  
M. D. Edge

There are two traditional ways to learn statistics. One way is to pass over the mathematical underpinnings and focus on developing relatively shallow knowledge about a wide variety of statistical procedures. Another is to spend years learning the mathematics necessary for traditional mathematical approaches to statistics. For many people who need to analyze data, neither of these paths is sufficient. The shallow-but-wide approach fails to provide students with the foundation that allows for confidence and creativity in analyzing modern datasets, and many researchers—though possibly motivated to learn math—do not have the background to start immediately on a traditional mathematical approach. This book exists to help researchers jump between tracks, providing motivated students whose knowledge of mathematics may be incomplete or rusty with a serious introduction to statistics that allows further study from more mathematical sources. This is done by focusing on a single statistical technique that is fundamental to statistical practice—simple linear regression—and supplementing the exposition with ample simulations conducted in the statistical programming language R. The first half of the book focuses on preliminaries, including the use of R and probability theory, whereas the second half covers statistical estimation and inference from semiparametric, parametric, and Bayesian perspectives.



Author(s):  
M. D. Edge

If it is reasonable to assume that the data are generated by a fully parametric model, then maximum-likelihood approaches to estimation and inference have many appealing properties. Maximum-likelihood estimators are obtained by identifying parameters that maximize the likelihood function, which can be done using calculus or using numerical approaches. Such estimators are consistent, and if the costs of errors in estimation are described by a squared-error loss function, then they are also efficient compared with their consistent competitors. The sampling variance of a maximum-likelihood estimate can be estimated in various ways. As always, one possibility is the bootstrap. In many models, the variance of the maximum-likelihood estimator can be derived directly once its form is known. A third approach is to rely on general properties of maximum-likelihood estimators and use the Fisher information. Similarly, there are many ways to test hypotheses about parameters estimated by maximum likelihood. This chapter discusses the Wald test, which relies on the fact that the sampling distribution of maximum-likelihood estimators is normal in large samples, and the likelihood-ratio test, which is a general approach for testing hypotheses relating nested pairs of models.



Author(s):  
M. D. Edge

This chapter marks a turning point. The preceding two chapters considered probability theory, which describes the kinds of data that result from specified processes. The remainder of the book consider statistical estimation and inference, which starts with data and attempts to make conclusions about the process that produced them. First, general concepts in statistical estimation and inference are discussed, and then simple linear regression from nonparametric/semiparametric, parametric frequentist, and Bayesian perspectives.



Author(s):  
M. D. Edge

In this chapter, the behavior of random variables is summarized using the concepts of expectation, variance, and covariance. The expectation is a measurement of the location of a random variable’s distribution. The variance and its square root, the standard deviation, are measurements of the spread of a random variable’s distribution. Covariance and correlation are measurements of the extent of linear relationship between two random variables. The chapter also describe two important theorems that describe the distribution of means of samples from a distribution. As the sample size becomes larger, the distribution of the sample mean becomes bunched more tightly around the expectation—this is the law of large numbers—and the distribution of the sample mean approaches the shape of a normal distribution—this is the central limit theorem. Finally, a model describing a linear relationship between two random variables is considered, and the properties of those two random variables are analyzed.



Author(s):  
M. D. Edge

Becoming a well-rounded data analyst requires more than the skills covered in this book. This postlude sketches some ways in which the types of thinking covered here can be extended to real problems in data analysis. Different ways of evaluating the assumptions of linear regression are considered, including plotting, hypothesis tests, and out-of-sample prediction. If the assumptions are not met, simple linear regression can be extended in various ways, including multiple regression, generalized linear models, and mixed models (among many other possibilities). This postlude concludes with a short discussion of the themes of the book: probabilistic models, methodological pluralism, and the value of elementary statistical thinking.



Sign in / Sign up

Export Citation Format

Share Document