$$ \newcommand{\bs}{\boldsymbol} \newcommand{\mb}{\mathbf} \newcommand{\E}{\mathbb{E}} \newcommand{\V}{\mathbb{V}} \newcommand{\var}{\text{var}} \newcommand{\cov}{\text{cov}} \newcommand{\N}{\mathcal{N}} \newcommand{\Bern}{\text{Bern}} \newcommand{\Bin}{\text{Bin}} \newcommand{\Pois}{\text{Pois}} \newcommand{\Unif}{\text{Unif}} \newcommand{\se}{\textsf{se}} \newcommand{\au}{\underline{a}} \newcommand{\du}{\underline{d}} \newcommand{\Au}{\underline{A}} \newcommand{\Du}{\underline{D}} \newcommand{\xu}{\underline{x}} \newcommand{\Xu}{\underline{X}} \newcommand{\Yu}{\underline{Y}} \renewcommand{\P}{\mathbb{P}} \newcommand{\U}{\mb{U}} \newcommand{\Xbar}{\overline{X}} \newcommand{\Ybar}{\overline{Y}} \newcommand{\real}{\mathbb{R}} \newcommand{\bbL}{\mathbb{L}} \renewcommand{\u}{\mb{u}} \renewcommand{\v}{\mb{v}} \newcommand{\M}{\mb{M}} \newcommand{\X}{\mb{X}} \newcommand{\Xmat}{\mathbb{X}} \newcommand{\bfx}{\mb{x}} \newcommand{\y}{\mb{y}} \renewcommand{\bfbeta}{\bs{\beta}} \newcommand{\e}{\bs{\epsilon}} \newcommand{\bhat}{\widehat{\bs{\beta}}} \newcommand{\XX}{\Xmat'\Xmat} \newcommand{\XXinv}{\left(\XX\right)^{-1}} \newcommand{\hatsig}{\hat{\sigma}^2} \newcommand{\red}[1]{\textcolor{red!60}{#1}} \newcommand{\indianred}[1]{\textcolor{indianred}{#1}} \newcommand{\blue}[1]{\textcolor{blue!60}{#1}} \newcommand{\dblue}[1]{\textcolor{dodgerblue}{#1}} \newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\inprob}{\overset{p}{\to}} \newcommand{\indist}{\overset{d}{\to}} \newcommand{\eframe}{\end{frame}} \newcommand{\bframe}{\begin{frame}} \newcommand{\R}{\textsf{\textbf{R}}} \newcommand{\Rst}{\textsf{\textbf{RStudio}}} \newcommand{\rfun}[1]{\texttt{\color{magenta}{#1}}} \newcommand{\rpack}[1]{\textbf{#1}} \newcommand{\rexpr}[1]{\texttt{\color{magenta}{#1}}} \newcommand{\filename}[1]{\texttt{\color{blue}{#1}}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

4  Hypothesis tests

Up to now, we have discussed the properties of estimators that allow us to characterize their distributions in finite and large samples. These properties might let us say that, for example, our estimated difference in means is equal to a true average treatment effect on average across repeated samples or that it will converge to the true value in large samples. These properties, however, are properties of repeated samples. As researchers, we will only have access to a single sample. Statistical inference is the process of using our single sample to learn about population parameters. Several ways to conduct inference are connected, but one of the most ubiquitous in the sciences is the hypothesis test, which is a kind of statistical thought experiment.

4.1 The lady tasting tea

The lady tasting tea exemplifies the core ideas behind hypothesis testing due to R.A. Fisher.1 Fisher had prepared tea for his colleague, the algologist Muriel Bristol. Knowing that she preferred milk in her tea, he poured milk into a tea cup and then poured the hot tea into the milk. Bristol rejected the cup, stating that she preferred pouring the tea first, then milk. Fisher was skeptical at the idea anyone could tell the difference between a cup poured milk-first or tea-first. So he and another colleague, William Roach, devised a test to see if Bristol could distinguish the two preparation methods.

Fisher and Roach prepared 8 cups of tea, four milk-first and four tea-first. They then presented the cups to Bristol in a random order (though she knew there were 4 of each type), and she proceeded to identify all of the cups correctly. At first glance, this seems like good evidence that she can tell the difference between the two types, but a skeptic like Fisher raised the question: “could she have just been randomly guessing and got lucky?” This led Fisher to a statistical thought experiment: what would the probability of guessing the correct cups be if she were guessing randomly?

To calculate the probability of Bristol’s achievement, we can note that “randomly guessing” here would mean that she was selecting a group of 4 cups to be labeled milk-first from the 8 cups available. Using basic combinatorics, we can calculate there are 70 ways to choose 4 cups among 8, but only 1 of those arrangements would be correct. Thus, if randomly guessing means choosing among those 70 options with equal chance, then the probability of guessing the right set of cups is 1/70 or \(\approx 0.014\). The low probability implies that the hypothesis of random guessing may be implausible.

The story of the lady tasting tea encapsulates many of the core elements of hypothesis testing. Hypothesis testing is about taking our observed estimate (Bristol guessing all the cups correctly) and seeing how likely that observed estimate would be under some assumption or hypothesis about the data-generating process (Bristol was randomly guessing). When the observed estimate is unlikely under the maintained hypothesis, we might view this as evidence against that hypothesis. Thus, hypothesis tests help us assess evidence for particular guesses about the DGP.

Notation alert

For the rest of this chapter, we’ll introduce the concepts following the notation in the past chapters. We’ll usually assume that we have a random (iid) sample of random variables \(X_1, \ldots, X_n\) from a distribution, \(F\). We’ll focus on estimating some parameter, \(\theta\) of this distribution (like the mean, median, variance, etc.). We’ll refer to \(\Theta\) as the set of possible values of \(\theta\) or the parameter space.

4.2 Hypotheses

In the context of hypothesis testing, hypotheses are just statements about the population distribution. In particular, we will make statements that \(\theta = \theta_0\) where \(\theta_0 \in \Theta\) is the hypothesized value of \(\theta\). Hypotheses are ubiquitous in empirical work, but here are some examples to give you a flavor:

  • The population proportion of US citizens that identify as Democrats is 0.33.
  • The population difference in average voter turnout between households who received get-out-the-vote mailers vs. those who did not is 0.
  • The difference in the average incidence of human rights abuse in countries that signed a human rights treaty vs. those countries that did not sign is 0.

Each of these is a statement about the true DGP. The latter two are very common: when \(\theta\) represents the difference in means between two groups, then \(\theta = 0\) is the hypothesis of no actual difference in population means or no treatment effect (if the causal effect is identified).

The goal of hypothesis testing is to adjudicate between two complementary hypotheses.

Definition 4.1 The two hypotheses in a hypothesis test are called the null hypothesis and the alternative hypothesis, denoted as \(H_0\) and \(H_1\), respectively.

These hypotheses are complementary, so if the null hypothesis \(H_0: \theta \in \Theta_0\), then the alternative hypothesis is \(H_1: \theta \in \Theta_0^c\). The “null” in null hypothesis might seem odd until you realize that most null hypotheses are that there is no effect of some treatment or no difference in means. For example, suppose \(\theta\) is the difference in mean support for expanding legal immigration between a treatment group that received a pro-immigrant message and some facts about immigration and a control group that just received the factual information. Then, the typical null hypothesis would be no difference in means or \(H_0: \theta = 0\), and the alternative would be \(H_1: \theta \neq 0\).

There are two types of tests that differ in the form of their null and alternative hypotheses. A two-sided test is of the form \[ H_0: \theta = \theta_0 \quad\text{versus}\quad H_1: \theta \neq \theta_0, \] where the “two-sided” part refers to how the alternative contains values of \(\theta\) above and below the null value \(\theta_0\). A one-sided test has the form \[ H_0: \theta \leq \theta_0 \quad\text{versus}\quad H_1: \theta > \theta_0, \] or \[ H_0: \theta \geq \theta_0 \quad\text{versus}\quad H_1: \theta < \theta_0. \] Two-sided tests are much more common in the social sciences, where we want to know if there is any evidence, positive or negative, against the presumption of no treatment effect or no relationship between two variables. One-sided tests are for situations where we only want evidence in one direction, which is rarely relevant to social science research. One-sided tests also have the downside of being misused to inflate the strength of evidence against the null and should be avoided. Unfortunately, the math of two-sided tests is also more complicated.

4.3 The procedure of hypothesis testing

At the most basic level, a hypothesis test is a rule that specifies values of the sample data for which we will decide to reject the null hypothesis. Let \(\mathcal{X}_n\) be the range of the sample—that is, all possible vectors \((x_1, \ldots, x_n)\) that have positive probability of occurring. Then, a hypothesis test describes a region of this space, \(R \subset \mathcal{X}_n\), called the rejection region where when \((X_1, \ldots, X_n) \in R\) we will reject \(H_0\) and when the data is outside this region, \((X_1, \ldots, X_n) \notin R\) we retain, accept, or fail to reject the null hypothesis.2

How do we decide what the rejection region should be? Even though we define the rejection region in terms of the sample space, \(\mathcal{X}_n\), it’s unwieldy to work with the entire vector of data. Instead, we often formulate the rejection region in terms of a test statistic, \(T = T(X_1, \ldots, X_n)\), where the rejection region becomes \[ R = \left\{(x_1, \ldots, x_n) : T(x_1, \ldots, x_n) > c\right\}, \] where \(c\) is called the critical value. This expression says that the rejection region is the part of the sample space that makes the test statistic sufficiently large. We reject null hypotheses when the observed data is incompatible with those hypotheses, where the test statistic should be a measure of this incompatibility. Note that the test statistic is a random variable and has a distribution—we will exploit this to understand the different properties of a hypothesis test.

Example 4.1 Suppose that \((X_1, \ldots, X_n)\) represents a sample of US citizens where \(X_i = 1\) indicates support for the current US president and \(X_i = 0\) means no support. We might be interested in the test of the null hypothesis that the president does not have the support of a majority of American citizens. Let \(\mu = \E[X_i] = \P(X_i = 1)\). Then, a one-sided test would compare the two hypotheses: \[ H_0: \mu \leq 0.5 \quad\text{versus}\quad H_1: \mu > 0.5. \] In this case, we might use the sample mean as the test statistic, so that \(T(X_1, \ldots, X_n) = \Xbar_n\) and we have to find some threshold above 0.5 such that we would reject the null, \[ R = \left\{(x_1, \ldots, x_n): \Xbar_n > c\right\}. \] In words, how much support should we see for the current president before we reject the notion that they lack majority support? Below we will select the critical value, \(c\), to have beneficial statistical properties.

The structure of a reject region will depend on whether a test is one- or two-sided. One-sided tests will take the form \(T > c\), whereas two-sided tests will take the form \(|T| > c\) since we want to count deviations from either side of the null hypothesis as evidence against that null.

4.4 Testing errors

Hypothesis tests end with a decision to reject the null hypothesis or not, but this might be an incorrect decision. In particular, there are two ways to make errors and two ways to be correct in this setting, as shown in Table 4.1. The labels are confusing, but it’s helpful to remember that type I errors (said “type one”) are labeled so because they are the worse of the two types of errors. These errors occur when we reject a null (say there is a true treatment effect or relationship) when the null is true (there is no true treatment effect or relationship). Type I errors are what we see in the replication crisis: lots of “significant” effects that turn out later to be null. Type II errors (said “type two”) are considered less problematic: there is a true relationship, but we cannot detect it with our test (we cannot reject the null).

Table 4.1: Typology of testing errors
\(H_0\) True \(H_0\) False
Retain \(H_0\) Awesome Type II error
Reject \(H_0\) Type I error Great

Ideally, we would minimize the chances of making either a type I or type II error. Unfortunately, because the test statistic is a random variable, we cannot remove the probability of an error altogether. Instead, we will derive tests with some guaranteed performance to minimize the probability of type I error. To derive this, we can define the power function of a test, \[ \pi(\theta) = \P\left( \text{Reject } H_0 \mid \theta \right) = \P\left( T \in R \mid \theta \right), \] which is the probability of rejection as a function of the parameter of interest, \(\theta\). The power function tells us, for example, how likely we are to reject the null of no treatment effect as we vary the actual size of the treatment effect.

We can define the probability of type I error from the power function.

Definition 4.2 The size of a hypothesis test with the null hypothesis \(H_0: \theta = \theta_0\) is \[ \pi(\theta_0) = \P\left( \text{Reject } H_0 \mid \theta_0 \right). \]

You can think of the size of a test as the rate of false positives (or false discoveries) produced by the test. Figure 4.1 shows an example of rejection regions, size, and power for a one-sided test. In the left panel, we have the distribution of the test statistic under the null, with \(H_0: \theta = \theta_0\), and the rejection region is defined by values \(T > c\). The shaded grey region is the probability of rejection under this null hypothesis or the size of the test. Sometimes, we will get extreme samples by random chance, even under the null, leading to false discoveries.3

In the right panel, we overlay the distribution of the test statistic under one particular alternative, \(\theta = \theta_1 > \theta_0\). The red-shaded region is the probability of rejecting the null when this alternative is true or the power—it’s the probability of correctly rejecting the null when it is false. Intuitively, we can see that alternatives that produce test statistics closer to the rejection region will have higher power. This makes sense: detecting big deviations from the null should be easier than detecting minor ones.

Figure 4.1: Size of a test and power against an alternative.

Figure 4.1 also hints at a tradeoff between size and power. Notice that we could make the size smaller (lower the false positive rate) by increasing the critical value to \(c' > c\). This would make the probability of being in the rejection region smaller, \(\P(T > c' \mid \theta_0) < \P(T > c \mid \theta_0)\), leading to a lower-sized test. Unfortunately, it would also reduce power in the right panel since the probability of being in the rejection region will be lower under any alternative, \(\P(T > c' \mid \theta_1) < \P(T > c \mid \theta_1)\). This means we usually cannot simultaneously reduce both types of errors.

4.5 Determining the rejection region

If we cannot simultaneously optimize a test’s size and power, how should we determine where the reject region is? That is, how should we decide what empirical evidence will be strong enough for us to reject the null? The standard approach to this problem in hypothesis testing is to control the size of a test (that is, control the rate of false positives) and try to maximize the power of the test subject to that constraint. So we say, “I’m willing to accept at most x%” of findings will be false positives and do whatever we can to maximize power subject to that constraint.

Definition 4.3 A test has significance level \(\alpha\) if its size is less than or equal to \(\alpha\), or \(\pi(\theta_0) \leq \alpha\).

A test with a significance level of \(\alpha = 0.05\) will have a false positive/type I error rate no larger than 0.05. This level is widespread in the social sciences, though you also will \(\alpha = 0.01\) or \(\alpha = 0.1\). Frequentists justify this by saying this means that with \(\alpha = 0.05\), there will only be 5% of studies that will produce false discoveries.

Our task is to construct the rejection region so that the null distribution of the test statistic \(G_0(t) = \P(T \leq t \mid \theta_0)\) has less than \(\alpha\) probability in that region. One-sided tests like in Figure 4.1 are the easiest to show, even though we warned you not to use them. We want to choose \(c\) that puts no more than \(\alpha\) probability in the tail, or \[ \P(T > c \mid \theta_0) = 1 - G_0(c) \leq \alpha. \] Remembering that the smaller the value of \(c\) we can use will maximize power, which implies that the critical value for the maximum power while maintaining the significance level is when \(1 - G_0(c) = \alpha\). We can use the quantile function of the null distribution to find the exact value of \(c\) we need, \[ c = G^{-1}_0(1 - \alpha), \] which is just fancy math to say, “the value at which \(1-\alpha\) of the null distribution is below.”

The determination of the rejection region follows the same principles for two-sided tests, but it is slightly more complicated because we reject when the magnitude of the test statistic is large, \(|T| > c\). Figure 4.2 shows that basic setup. Notice that because there are two (disjoint) regions, we can write the size (false positive rate) as \[ \pi(\theta_0) = G_0(-c) + 1 - G_0(c) \] In most cases that we will see, the null distribution for such a test will be symmetric around 0 (usually asymptotically standard normal, actually), which means that \(G_0(-c) = 1 - G_0(c)\), which implies that the size is \[ \pi(\theta_0) = 2(1 - G_0(c)). \] Solving for the critical value that would make this \(\alpha\) gives \[ c = G^{-1}_0(1 - \alpha/2). \] Again, this formula can seem dense, but remember what you are doing: finding the value that puts \(\alpha/2\) of the probability of the null distribution in each tail.

Figure 4.2: Rejection regions for a two-sided test.

4.6 Hypothesis tests of the sample mean

Let’s go through an extended example about hypothesis testing of a sample mean, sometimes called a one-sample test. Let’s say \(X_i\) are feeling thermometer scores about “liberals” as a group on a scale of 0 to 100, with values closer to 0 indicating cooler feelings about liberals and values closer to 100 indicating warmer feelings about liberals. We want to know if the population average differs from a neutral value of 50. We can write this two-sided test as \[ H_0: \mu = 50 \quad\text{versus}\quad H_1: \mu \neq 50, \] where \(\mu = \E[X_i]\). The standard test statistic for this type of test is the so-called t-statistic, \[ T = \frac{\left( \Xbar_n - \mu_0 \right)}{\sqrt{s^2 / n}} =\frac{\left( \Xbar_n - 50 \right)}{\sqrt{s^2 / n}}, \] where \(\mu_0\) is the null value of interest and \(s^2\) is the sample variance. If the null hypothesis is true, then by the CLT, we know that the t-statistic is asymptotically normal, \(T \indist \N(0, 1)\). Thus, we can approximate the null distribution with the standard normal!

Let’s create a test with level \(\alpha = 0.05\). Then we need to find the rejection region that puts \(0.05\) probability in the tails of the null distribution, which we just saw was \(\N(0,1)\). Let \(\Phi()\) be the CDF for the standard normal and let \(\Phi^{-1}()\) be the quantile function for the standard normal. Drawing on what we developed above, you can find the value \(c\) so that \(\P(|T| > c \mid \mu_0)\) is 0.05 with \[ c = \Phi^{-1}(1 - 0.05/2) \approx 1.96, \] which means that a test where we reject when \(|T| > 1.96\) would have a level of 0.05 asymptotically.

4.7 The Wald test

We can generalize the hypothesis test for the sample mean to estimators more broadly. Let \(\widehat{\theta}_n\) be an estimator for some parameter \(\theta\) and let \(\widehat{\textsf{se}}[\widehat{\theta}_n]\) be a consistent estimate of the standard error of the estimator, \(\textsf{se}[\widehat{\theta}_n] = \sqrt{\V[\widehat{\theta}_n]}\). We consider the two-sided test \[ H_0: \theta = \theta_0 \quad\text{versus}\quad H_1: \theta \neq \theta_0. \]

In many cases, our estimators will be asymptotically normal by a version of the CLT so that under the null hypothesis, we have \[ T = \frac{\widehat{\theta}_n - \theta_0}{\widehat{\textsf{se}}[\widehat{\theta}_n]} \indist \N(0, 1). \] The Wald test rejects \(H_0\) when \(|T| > z_{\alpha/2}\), where \(z_{\alpha/2}\) that puts \(\alpha/2\) in the upper tail of the standard normal. That is, if \(Z \sim \N(0, 1)\), then \(z_{\alpha/2}\) satisfies \(\P(Z \geq z_{\alpha/2}) = \alpha/2\).


In R, you can find the \(z_{\alpha/2}\) values easily with the qnorm() function:

qnorm(0.05 / 2, lower.tail = FALSE)
[1] 1.959964

Theorem 4.1 Asymptotically, the Wald test has size \(\alpha\) such that \[ \P(|T| > z_{\alpha/2} \mid \theta_0) \to \alpha. \]

This result is very general, and it means that many, many hypothesis tests based on estimators will have the same form. The main difference across estimators will be how we calculate the estimated standard error.

Example 4.2 (Difference in proportions) In get-out-the-vote (GOTV) experiments, we might randomly assign a group of citizens to receive mailers encouraging them to vote, whereas a control group receives no message. We’ll define the turnout variables in the treatment group \(Y_{1}, Y_{2}, \ldots, Y_{n_t}\) as iid draws from a Bernoulli distribution with success \(p_t\), which represents the population turnout rate among treated citizens. The outcomes in the control group \(X_{1}, X_{2}, \ldots, X_{n_c}\) are iid draws from another Bernoulli distribution with success \(p_c\), which represents the population turnout rate among citizens not receiving a mailer.

Our goal is to learn about the treatment effect of this treatment on whether or not the citizen votes, \(\tau = p_t - p_c\), and we will use the sample difference in means/proportions as our estimator, \(\widehat{\tau} = \Ybar - \Xbar\). To perform a Wald test, we need to know/estimate the standard error of this estimator. Notice that because these are independent samples, the variance is \[ \V[\widehat{\tau}_n] = \V[\Ybar - \Xbar] = \V[\Ybar] + \V[\Xbar] = \frac{p_t(1-p_t)}{n_t} + \frac{p_c(1-p_c)}{n_c}, \] where the third equality comes from the fact that the underlying outcome variables \(Y_i\) and \(X_j\) are binary. Obviously, we do not know the true population proportions \(p_t\) and \(p_c\) (that’s why we’re doing the test!), but we can estimate the standard error by replacing them with their estimates \[ \widehat{\textsf{se}}[\widehat{\tau}] = \sqrt{\frac{\Ybar(1 -\Ybar)}{n_t} + \frac{\Xbar(1-\Xbar)}{n_c}}. \]

The typical null hypothesis test, in this case, is “no treatment effect” vs. “some treatment effect” or \[ H_0: \tau = p_t - p_c = 0 \quad\text{versus}\quad H_1: \tau \neq 0, \] which gives the following test statistic for the Wald test \[ T = \frac{\Ybar - \Xbar}{\sqrt{\frac{\Ybar(1 -\Ybar)}{n_t} + \frac{\Xbar(1-\Xbar)}{n_c}}}. \] If we wanted a test with level \(\alpha = 0.01\), we would reject the null when \(|T| > 2.58\) since

qnorm(0.01/2, lower.tail = FALSE)
[1] 2.575829

Example 4.3 (Difference in means) Let’s take a similar setting to the last example with randomly assigned treatment and control groups, but now the treatment is an appeal for donations, and the outcomes are continuous measures of how much a person donated to the political campaign. Now the treatment data \(Y_1, \ldots, Y_{n_t}\) are iid draws from a population with mean \(\mu_t = \E[Y_i]\) and population variance \(\sigma^2_t = \V[Y_i]\). The control data \(X_1, \ldots, X_{n_c}\) are iid draws (independent of the \(Y_i\)) from a population with mean \(\mu_c = \E[X_i]\) and population variance \(\sigma^2_c = \V[X_i]\). The parameter of interest is similar to before: the population difference in means, \(\tau = \mu_t - \mu_c\), and we’ll form the usual hypothesis test of \[ H_0: \tau = \mu_t - \mu_c = 0 \quad\text{versus}\quad H_1: \tau \neq 0. \]

The only difference between this setting and the difference in proportions is the standard error here will be different because we cannot rely on the Bernoulli. Instead, we’ll use our knowledge of the sampling variance of the sample means and independence between the samples to derive \[ \V[\widehat{\tau}] = \V[\Ybar] + \V[\Xbar] = \frac{\sigma^2_t}{n_t} + \frac{\sigma^2_c}{n_c}, \] where we can come up with an estimate of the unknown population variance with sample variances \[ \widehat{\se}[\widehat{\tau}] = \sqrt{\frac{s^2_t}{n_t} + \frac{s^2_c}{n_c}}. \] We can use this estimator to derive the Wald test statistic of \[ T = \frac{\widehat{\tau} - 0}{\widehat{\se}[\widehat{\tau}]} = \frac{\Ybar - \Xbar}{\sqrt{\frac{s^2_t}{n_t} + \frac{s^2_c}{n_c}}}, \] and if we want an asymptotically level of 0.05, we can reject when \(|T| > 1.96\).

4.8 p-values

The hypothesis testing framework focuses on actually making a decision in the face of uncertainty. You choose a level of wrongness you are comfortable with (rate of false positives) and then decide null vs. alternative based firmly on the rejection region. When we’re not making a decision, we are somewhat artificially discarding information about the strength of evidence. We “accept” the null if \(T = 1.95\) in the last example but reject it if \(T = 1.97\) even though these two situations are actually very similar. Just reporting the reject/retain decision also fails to give us a sense of at what other levels we might have rejected the null. Again, this makes sense if we need to make a single decision: other tests don’t matter because we carefully considered our \(\alpha\) level test. But in the lower-stakes world of the academic social sciences, we can afford to be more informative.

One alternative to reporting the reject/retain decision is to report a p-value.

Definition 4.4 The p-value of a test is the probability of observing a test statistic is at least as extreme as the observed test statistic in the direction of the alternative hypothesis.

The line “in the direction of the alternative hypothesis” deals with the unfortunate headache of one-sided versus two-sided tests. For a one-sided test where larger values of \(T\) correspond to more evidence for \(H_1\), the p-value is \[ \P(T(X_1,\ldots,X_n) > T \mid \theta_0) = 1 - G_0(T), \] whereas for a (symmetric) two-sided test, we have \[ \P(|T(X_1, \ldots, X_n)| > |T| \mid \theta_0) = 2(1 - G_0(|T|)). \]

In either case, the interpretation of the p-value is the same. It is the smallest size \(\alpha\) at which a test would reject null. Presenting a p-value allows the reader to determine their own \(\alpha\) level and determine quickly if the evidence would warrant rejecting \(H_0\) in that case. Thus, the p-value is a more continuous measure of evidence against the null, where lower values are stronger evidence against the null because the observed result is less likely under the null.

There is a lot of controversy surrounding p-values but most of it focuses on arbitrary p-value cutoffs for determining statistical significance and sometimes publication decisions. These problems are not the fault of p-values but rather the hyper fixation on the reject/retain decision for arbitrary test levels like \(\alpha = 0.05\). It might be best to view p-values as a transformation of the test statistic onto a common scale between 0 and 1.


People use many statistical shibboleths to purportedly identify people who don’t understand statistics and usually hinge on seemingly subtle differences in interpretation that are easy to miss. If you know the core concepts, the statistical shibboleths tend to be overblown, but it would be malpractice not to flag them for you.

The shibboleth with p-values is that sometimes people interpret them as “the probability that the null hypothesis is true.” Of course, this doesn’t make sense from our definition because the p-values conditions on the null hypothesis—it cannot tell us anything about the probability of that null hypothesis. Instead, the metaphor you should always carry is that hypothesis tests are statistical thought experiments and that p-values answer the question: how likely would my data be if the null were true?

4.9 Power analysis

Imagine you have spent a large research budget on a big experiment to test your amazing theory, and the results come back and… you fail to reject the null of no treatment effect. When this happens, there are two possible states of the world: the null is true, and you correctly identified that, or the null is false but the test had lower power to detect the true effect. Because of this uncertainty after the fact, it is common for researchers to conduct power analyses before running studies that try to forecast what sample size is necessary to ensure you can reject the null under a hypothesized effect size.

Generally power analyses involve calculating the power function \(\pi(\theta) = \P(T(X_1, \ldots, X_n) \in R \mid \theta)\) for different values of \(\theta\). It might also involve sample size calculations for a particular alternative, \(\theta_1\). In that case, we try to find the sample size \(n\) to make the power \(\pi(\theta_1)\) as close to a particular value (often 0.8) as possible. It is possible to solve for this sample size in simple one-sided tests explicitly. Still, for more general situations or two-sided tests, we typically need numerical or simulation-based approaches to find the optimal sample size.

With Wald tests, we can characterize the power function quite easily, even if it does not allow us to back out sample size calculations easily.

Theorem 4.2 For a Wald test with an asymptotically normal estimator, the power function for a particular alternative \(\theta_1 \neq \theta_0\) is \[ \pi(\theta_1) = 1 - \Phi\left( \frac{\theta_0 - \theta_1}{\widehat{\se}[\widehat{\theta}_n]} + z_{\alpha/2} \right) + \Phi\left( \frac{\theta_0 - \theta_1}{\widehat{\se}[\widehat{\theta}_n]}-z_{\alpha/2} \right). \]

4.10 Exact tests under normal data

The Wald test above relies on large sample approximations. In finite samples, these approximations may not be valid. Can we get exact inferences at any sample size? Yes, if we make stronger assumptions about the data. In particular, assume a parametric model for the data where \(X_1,\ldots,X_n\) are i.i.d. samples from \(N(\mu,\sigma^2)\). Under null of \(H_0: \mu = \mu_0\), we can show that \[ T_n = \frac{\Xbar_n - \mu_0}{s_n/\sqrt{n}} \sim t_{n-1}, \] where \(t_{n-1}\) is the Student’s t-distribution with \(n-1\) degrees of freedom. This result implies the null distribution is \(t\), so we use quantiles of \(t\) for critical values. For one-sided test \(c = G^{-1}_0(1 - \alpha)\) but now \(G_0\) is \(t\) with \(n-1\) df and so we use qt() instead of qnorm() to calculate these critical values.

The critical values for the \(t\) distribution are always larger than the normal because the t has fatter tails, as shown in Figure 4.3. As \(n\to\infty\), however, the \(t\) converges to the standard normal, and so it is asymptotically equivalent to the Wald test but slightly more conservative in finite samples. Oddly, most software packages calculate p-values and rejection regions based on the \(t\) to exploit this conservativeness.

Figure 4.3: Normal versus t distribution.

4.11 Confidence intervals and hypothesis tests

At first glance, we may seem sloppy in using \(\alpha\) in deriving a \(1 - \alpha\) confidence interval in the last chapter and an \(\alpha\)-level test in this chapter. In reality, we were foreshadowing the deep connection between the two: every \(1-\alpha\) confidence interval contains all null hypotheses that we would not reject with an \(\alpha\)-level test.

This connection is easiest to see with an asymptotically normal estimator, \(\widehat{\theta}_n\). Consider the hypothesis test of \[ H_0: \theta = \theta_0 \quad \text{vs.}\quad H_1: \theta \neq \theta_0, \] using the test statistic, \[ T = \frac{\widehat{\theta}_{n} - \theta_{0}}{\widehat{\se}[\widehat{\theta}_{n}]}. \] As we discussed in the earlier, an \(\alpha = 0.05\) test would reject this null when \(|T| > 1.96\), or when \[ |\widehat{\theta}_{n} - \theta_{0}| > 1.96 \widehat{\se}[\widehat{\theta}_{n}]. \] Notice that will be true when \[ \theta_{0} < \widehat{\theta}_{n} - 1.96\widehat{\se}[\widehat{\theta}_{n}]\quad \text{ or }\quad \widehat{\theta}_{n} + \widehat{\se}[\widehat{\theta}_{n}] < \theta_{0} \] or, equivalently, that null hypothesis is outside of the 95% confidence interval, \[\theta_0 \notin \left[\widehat{\theta}_{n} - 1.96\widehat{\se}[\widehat{\theta}_{n}], \widehat{\theta}_{n} + 1.96\widehat{\se}[\widehat{\theta}_{n}]\right].\] Of course, our choice of the null hypothesis was arbitrary, which means that any null hypothesis outside the 95% confidence interval would be rejected by a \(\alpha = 0.05\) level test of that null. And any null hypothesis inside the confidence interval is a null hypothesis that we would not reject.

This relationship holds more broadly. Any \(1-\alpha\) confidence interval contains all possible parameter values that would not be rejected as the null hypothesis of an \(\alpha\)-level hypothesis test. This connection can be handy for two reasons:

  1. We can quickly determine if we would reject a null hypothesis at some level by inspecting if it falls in a confidence interval.
  2. In some situations, determining a confidence interval might be difficult, but performing a hypothesis test is straightforward. Then, we can find the rejection region for the test and determine what null hypotheses would not be rejected at level \(\alpha\) to formulate the \(1-\alpha\) confidence interval. We call this process inverting a test. A critical application of this method is for formulating confidence intervals for treatment effects based on randomization inference in the finite population analysis of experiments.

  1. The analysis here largely comes from Senn (2012).↩︎

  2. Different people and different textbooks describe what to do when do not reject the null hypothesis in different ways. The terminology is not so important so long as you understand that rejecting the null does not mean the null is logically false, and “accepting” the null does not mean the null is logically true.↩︎

  3. Eagle-eyed readers will notice that the null tested here is a point, while we previously defined the null in a one-sided test as a region \(H_0: \theta \leq \theta_0\). Technically, the size of the test will vary based on which of these nulls we pick. In this example, notice that any null to the left of \(\theta_0\) will result in a lower size. And so, the null at the boundary, \(\theta_0\), will maximize the size of the test, making it the most “conservative” null to investigate. Technically, we should define the size of a test as \(\alpha = \sup_{\theta \in \Theta_0} \pi(\theta)\).↩︎