3 Asymptotics
3.1 Introduction
Suppose we are still interested in estimating the proportion of citizens who prefer increasing legal immigration. Based on the last chapter, a good strategy would be to use the sample proportion of immigration supporters in a random sample of citizens. You would have good reason to be confident with this estimator, with its finite-sample properties like unbiasedness and a sampling variance. We call these “finite-sample” properties since they hold at any sample size—they are as true for random samples of size for
Finite-sample results, though, are of limited value because they only tell us about the center and spread of the sampling distribution of
In this chapter, we take a different approach by asking what happens to the sampling distribution of estimators as the sample size gets very large, which we refer to as asymptotic theory. While asymptotics will often simplify derivations, an essential point is that everything we do with asymptotics will be an approximation. No one ever has infinite data, but we hope that the approximations will be closer to the truth as our samples get larger.
Asymptotic results are key to modern statistical methods because many methods of quantifying uncertainty about estimates rely on asymptotic approximations. We will rely on the asymptotic results we derive in this chapter to estimate standard errors, construct confidence intervals, and perform hypothesis tests, all without assuming a fully parametric model.
3.2 Convergence of deterministic sequences
A helpful place to begin is by reviewing the basic idea of convergence in deterministic sequences from calculus:
Definition 3.1 A sequence
We say that
Example 3.1 One important sequence that arises often in statistics is
Let us pick a specific value of
More generally, for any
We will mostly not use such formal definitions to establish a limit but, rather, rely on the properties of limits. For example, convergence and limits follow basic arithmetic operations. Suppose that we have two sequences with limits
if
These rules plus the result in Example 3.1 allow us to prove other useful facts such as
Can we apply a similar definition of convergence to sequences of random variables (like estimators)? Possibly. Some examples clarify why this might be difficult.1 Suppose we have a sequence of
Another example highlights subtle problems with a sequence of random variables converging to a single value. Suppose we have a sequence of random variables
3.3 Convergence in probability and consistency
A sequence of random variables can converge in several different ways. The first type of convergence deals with sequences converging to a single value.2
Definition 3.2 A sequence of random variables,
What’s happening in this definition? The even
Example 3.2 Let’s illustrate the definition of convergence in probability by constructing a sequence of random variables,
We can see intuitively that this sequence will be centered at zero with a shrinking variance. Below, we will see that this is enough to establish convergence in probability of
Let
Sometimes convergence in probability is written as
Convergence in probability is crucial for evaluating estimators. While we said that unbiasedness was not the be-all and end-all of properties of estimators, the following property is an essential and fundamental property of good estimators.
Definition 3.3 An estimator is consistent if
Consistency of an estimator implies that the sampling distribution of the estimator “collapses” on the true value as the sample size gets large. An estimator is inconsistent if it converges in probability to any other value. As the sample size gets large, the probability that an inconsistent estimator will be close to the truth will approach 0. Generally speaking, consistency is a very desirable property of an estimator.
Estimators can be inconsistent yet still converge in probability to an understandable quantity. For example, we will discuss in later chapters that regression coefficients estimated by ordinary least squares (OLS) are consistent for the conditional expectation if the conditional expectation is linear. If that function is non-linear, however, then OLS will be consistent for the best linear approximation to that function. While not ideal, it does mean that this estimator is at least consistent for an interpretable quantity.
We can also define convergence in probability for a sequence of random vectors,
3.4 Useful inequalities
At first glance, establishing an estimator’s consistency will be difficult. How can we know if a distribution will collapse to a specific value without knowing the shape or family of the distribution? It turns out that there are certain relationships between the mean and variance of a random variable and certain probability statements that hold for all distributions (that have finite variance, at least). These relationships are key to establishing results that do not depend on a specific distribution.
Theorem 3.1 (Markov Inequality) For any r.v.
Proof. Note that we can let
In words, Markov’s inequality says that the probability of a random variable being large in magnitude cannot be high if the average is not large in magnitude. Blitzstein and Hwang (2019) provide an excellent intuition behind this result using income as an example. Let
It’s pretty astounding how general this result is since it holds for all random variables. Of course, its generality comes at the expense of not being very informative. If
Theorem 3.2 (Chebyshev Inequality) Suppose that
Proof. To prove this, we only need to square both sides of the inequality inside the probability statement and apply Markov’s inequality:
Chebyshev’s inequality is a straightforward extension of the Markov result: the probability of a random variable being far from its mean (that is,
3.5 The law of large numbers
We can now use these inequalities to show how estimators can be consistent for their target quantities of interest without making parametric assumptions. Why are these inequalities helpful? Remember that convergence in probability was about the probability of an estimator being far away from a value going to zero. Chebyshev’s inequality shows that we can bound these exact probabilities.
The most famous consistency result has a special name.
Theorem 3.3 (Weak Law of Large Numbers) Let
Proof. Recall that the sample mean is unbiased, so
The weak law of large numbers (WLLN) shows that, under general conditions, the sample mean gets closer to the population mean as
The naming of the “weak” law of large numbers seems to imply the existence of a “strong” law of large numbers (SLLN), which is true. The SLLN states that the sample mean converges to the population mean with probability 1. This type of convergence, called almost sure convergence, is stronger than convergence in probability, which only says that the probability of the sample mean being close to the population mean converges to 1. While it is nice to know that this stronger form of convergence holds for the sample mean under the same assumptions, it is rare for researchers outside of theoretical probability and statistics to rely on almost sure convergence.
Example 3.3 Seeing how the distribution of the sample mean changes as a function of the sample size allows us to appreciate the WLLN. We can see this by taking repeated iid samples of different sizes from an exponential random variable with rate parameter 0.5 so that
The WLLN also holds for random vectors in addition to random variables. Let
Theorem 3.4 If
Note that many of the formal results presented so far have “moment conditions” that certain moments are finite. For the vector WLLN, we saw that applied to the mean of each variable in the vector. Some books use a shorthand for this:
3.6 Consistency of estimators
The WLLN shows that the sample mean of iid draws is consistent for the population mean, which is a massive result given that so many estimators are sample means of potentially complicated functions of the data. What about other estimators? The proof of the WLLN points to one way to determine that an estimator is consistent: if it is unbiased and the sampling variance shrinks as the sample size grows.
Theorem 3.5 For any estimator
Thus, for unbiased estimators, if we can characterize its sampling variance, we should be able to tell if it is consistent. This result is handy since working with the probability statements used for the WLLN can sometimes be confusing.
What about biased estimators? Consider a situation where we calculate average household income,
Theorem 3.6 (Properties of convergence in probability) Let
(continuous mapping theorem) if .
We can now see that many of the nasty problems with expectations and nonlinear functions are made considerably easier with convergence in probability in the asymptotic setting. So while we know that
Example 3.4 Suppose we implemented a survey by randomly selecting a sample from the population of size
The relevant estimator for this quantity is the mean of the outcome among those who responded, which is slightly more complicated than a typical sample mean because the denominator is a random variable:
We can establish consistency of our estimator, though, by noting that we can rewrite the estimator as a ratio of sample means
Keeping the difference between unbiased and consistent clear in your mind is essential. You can easily create ridiculous unbiased estimators that are inconsistent. Let’s return to our iid sample,
Some estimators are biased but consistent that are often much more interesting. We already saw one such estimator in Example 3.4, but there are many more. Maximum likelihood estimators, for example, are (under some regularity conditions) consistent for the parameters of a parametric model but are often biased.
To study these estimator, we can broaden Theorem 3.5 to the class of asymptotically unbiased estimators that have bias that vanishes as the sample size grows.
Theorem 3.7 For any estimator
Proof. Using Markov’s inequality, we have
We can use this result to show consistency for a large range of estimators.
Example 3.5 (Plug-in variance estimator) In the last chapter, we introduced the plug-in estimator for the population variance,
3.7 Convergence in distribution and the central limit theorem
Convergence in probability and the law of large numbers are beneficial for understanding how our estimators will (or will not) collapse to their estimand as the sample size increases. But what about the shape of the sampling distribution of our estimators? For statistical inference, we would like to be able to make probability statements such as
We need first to describe a weaker form of convergence to see how we will develop these approximations.
Definition 3.4 Let
Essentially, convergence in distribution means that as
Example 3.6 A simple example of convergence in distribution would be the sequence
One of the most remarkable results in probability and statistics is that a large class of estimators will converge in distribution to one particular family of distributions: the normal. This result is one reason we study the normal so much and why investing in building intuition about it will pay off across many domains of applied work. We call this broad class of results the “central limit theorem” (CLT), but it would probably be more accurate to refer to them as “central limit theorems” since much of statistics is devoted to showing the result in different settings. We now present the simplest CLT for the sample mean.
Theorem 3.8 (Central Limit Theorem) Let
In words: the sample mean of a random sample from a population with finite mean and variance will be approximately normally distributed in large samples. Notice how we have not made any assumptions about the distribution of the underlying random variables,
Why do we state the CLT in terms of the sample mean after centering and scaling by its standard error? Suppose we don’t normalize the sample mean in this way. In that case, it isn’t easy to talk about convergence in distribution because we know from the WLLN that
We can use this result to state approximations that we can use when discussing estimators such as
Definition 3.5 An estimator
Example 3.7 To illustrate how the CLT works, we can simulate the sampling distribution of the (normalized) sample mean at different sample sizes. Let
There are several properties of convergence in distribution that are helpful to us.
Theorem 3.9 (Properties of convergence in distribution) Let
for all continuous functions . converges in distribution to converges in distribution to converges in distribution to if
We refer to the last three results as Slutsky’s theorem. These results are often crucial for determining an estimator’s asymptotic distribution.
A critical application of Slutsky’s theorem is when we replace the (unknown) population variance in the CLT with an estimate. Recall the definition of the sample variance as
Like with the WLLN, the CLT holds for random vectors of sample means, where their centered and scaled versions converge to a multivariate normal distribution with a covariance matrix equal to the covariance matrix of the underlying random vectors of data,
Theorem 3.10 If
Notice that
As with the notation alert with the WLLN, we are using shorthand here,
3.8 Confidence intervals
We now turn to an essential application of the central limit theorem: confidence intervals.
Suppose we have run an experiment with a treatment and control group and have presented readers with our single best guess about the treatment effect using the difference in sample means. We have also presented the estimated standard error of this estimate to give readers a sense of how variable it is. But none of these approaches answer a fairly compelling question: what range of values of the treatment effect is plausible given the data we observe?
A point estimate of the difference in sample means typically has 0 probability of being the exact true value, but intuitively we hope that the true treatment effect is close to our estimate. Confidence intervals make this kind of intuition more formal by instead estimating ranges of values with a fixed percentage of these ranges containing the actual unknown parameter value.
We begin with the basic definition of a confidence interval.
Definition 3.6 A
We say that a
So a confidence interval is a random interval with a particular guarantee about how often it will contain the true value of the unknown population parameter (in our example, the true treatment effect). Remember what is random and what is fixed in this setup. The interval varies from sample to sample, but the true value of the parameter stays fixed even if it is unknown, and the coverage is how often we should expect the interval to contain that true value. The “repeating my sample over and over again” analogy can break down very quickly, so it is sometimes helpful to interpret it as giving guarantees across confidence intervals across different experiments. In particular, suppose that a journal publishes 100 quantitative articles annually, each producing a single 95% confidence interval for their quantity of interest. Then, if the confidence intervals are valid and each is constructed in the exact same way, we should expect 95 of those confidence intervals to contain the true value.
Suppose we have a 95% confidence interval,
In most cases, we will not be able to derive exact confidence intervals but rather confidence intervals that are asymptotically valid, which means that if we write the interval as a function of the sample size,
We can show asymptotic coverage for most confidence intervals since we usually rely on large-sample approximations based on the central limit theorem.
3.8.1 Deriving confidence intervals
To derive confidence intervals, consider the standard formula for the 95% confidence interval of the sample mean,
Suppose we have an estimator,
How can we generalize this to
## alpha = 0.1 for 90% CI
qnorm(0.1 / 2, lower.tail = FALSE)
[1] 1.644854
As a concrete example, then, we could derive a 90% asymptotic confidence interval for the sample mean as
3.8.2 Interpreting confidence intervals
A very important point is that the interpretation of confidence is how the random interval performs over repeated samples. A valid 95% confidence interval is a random interval that contains the true population value in 95% of samples. Simulating repeated samples helps clarify this.
Example 3.8 Suppose we are taking samples of size
- Draw a sample of
from . - Calculate the 95% confidence interval sample mean
. - Plot the intervals along the x-axis and color them blue if they contain the truth (1) and red if not.
Figure 3.4 shows 100 iterations of these steps. We see that, as expected, most calculated CIs do contain the true value. Five random samples produce intervals that fail to include 1, an exact coverage rate of 95%. Of course, this is just one simulation, and a different set of 100 random samples might have produced a slightly different coverage rate. The guarantee of the 95% confidence intervals is that if we were to continue to take these repeated samples, the long-run frequency of intervals covering the truth would approach 0.95.
3.9 Delta method
Suppose that we know that an estimator follows the CLT, and so we have
Theorem 3.11 If
Understanding what is happening here provides intuition as to when this might go wrong. Why do we focus on continuously differentiable functions,
Example 3.9 Let’s return to the iid sample
Example 3.10 What about estimating the
Like all of the results in this chapter, there is a multivariate version of the delta method that is incredibly useful in practical applications. For example, suppose we want to combine two different estimators (or two different estimated parameters) to estimate another quantity. We now let
Theorem 3.12 Suppose that
This result follows from the approximation above plus rules about variances of random vectors. Recall that for any compatible matrix of constants,
The delta method is handy for generating closed-form approximations for asymptotic standard errors, but the math is often quite complex for even simple estimators. It is usually more straightforward for applied researchers to use computational tools such as the bootstrap to approximate the needed standard errors. The bootstrap has the trade-off of taking more computational time to implement compared to the delta method, but it is more easily adaptable across different estimators and domains.
3.10 Summary
In this chapter, we covered asymptotic analysis, which considers how estimators behave as we feed them larger and larger samples. While we never actually have infinite data, asymptotic results provide approximations that work quite well in practice. A consistent estimator converges in probability to a desired quantity of interest. We saw several ways of establishing consistency, including the Law of Large Numbers for the sample mean, which converges in probability to the population mean. The Central Limit Theorem tells us that the sample mean will be approximately normally distributed when we have large, iid samples. We also saw how the continuous mapping theorem and Slutsky’s theorem allow us to determine asymptotic results for a broad class of estimators. Knowing the asymptotic normality of an estimator allows us to derive confidence intervals that are valid in large samples. Finally, the delta method is a general tool for finding the asymptotic distribution of an estimator that is a function of another estimator with a known asymptotic distribution.
In the next chapter, we will leverage these asymptotic results to introduce another important tool for statistical inference: the hypothesis test.