A User’s Guide to Statistical Inference and Regression
Preface
This book, like many before it, will try to teach you statistics. The field of statistics describes how we learn about the world using quantitative data. In the social sciences, an increasing share of empirical studies use statistical methods to provide evidence for or against conceptual arguments. And, while it is possible to conduct quantitative research without understanding statistics at an intuitive level, it is not a good idea. Quantitative research involves a host of choices about the model to use, variables to include, tuning parameters to set, assumptions to make, and so on. Without a deep understanding of statistics, you may find these choices bewildering and confusing, and you may simply (and possibly erroneously) yield to the default settings of your statistical software.
The goal of this book is to give you the foundation to make methodological choices for your specific application with knowledge and with confidence. The material is intended for first-year PhD students in political science, but it may be of interest more broadly.
We will focus on two key goals:
Understand the basic ways to assess estimators With quantitative data, we often want to make statistical inferences about some unknown feature of the world. We use estimators (which are just ways of summarizing our data) to estimate these features. This book will introduce the basics of this task at a general enough level to be applicable to almost any estimator that you are likely to encounter in empirical research in the social sciences. We will also cover major concepts such as bias, sampling variance, consistency, and asymptotic normality, which are so common to such a large swath of (frequentist) inference that understanding them at a deep level will yield an enormous return on your time investment. Once you understand these core ideas, you will have a language to analyze any fancy new estimator that pops up in the next few decades.
Apply these ideas to the estimation of regression models This book will apply these ideas to one particular social science workhorse: regression. Many methods either use regression estimators like ordinary least squares or extend them in some way. Understanding how these estimators work is vital for conducting research, for reading and reviewing contemporary scholarship, and, frankly, for being a good and valuable colleague in seminars and workshops. Regression and regression estimators also provide an entry point for discussing parametric models as approximations, rather than as rigid assumptions about the truth of a given specification.
Why write a book on statistics and regression when so many already exist? While some texts at this level exist in the fields of statistics and economics, they tend to focus on applications and models less relevant to other social sciences. This book attempts to correct this. The book also seeks to introduce a fairly high level of mathematical sophistication that will challenge and push you to develop stronger foundations in the material.
Roadmap
This book has two major parts. Part I introduces the basics of statistical inference.
We start in 1 Design-based Inference by demonstrating basic concepts of estimation and inference from the design-based perspective in which we sample from a fixed, finite population, and all uncertainty comes from randomness over who is and is not included in the sample. This framework for inference has deep roots in the statistical literature and provides a great deal of intuition for how estimation and uncertainty work in simple settings. We will discuss how to use design-based inference to estimate features of the population from samples when the analyst knows the exact sampling design. Unfortunately, researchers often lack this knowledge about how their data came to be, limiting the usefulness of this approach.
2 Model-based inference introduces a more flexible approach to estimation: model-based inference. With this approach, the researcher posits a probability model for how the data came to be. This book focuses on models that posit “independent and identically distributed” data for this model. The chapter describes how estimation and inference proceed under these models and also introduces a broad class of estimators based on the plug-in principle.
These two chapters focus on finite sample properties of different estimation techniques, but we can say more about an estimator if we consider how it behaves on larger and larger samples. 3 Asymptotics introduces this type of asymptotic analysis. It covers the core results of asymptotic theory, such as the law of large numbers, the central limit theorem, and the delta method, but also shows why these results are important for statistical inference. In particular, the chapter shows how these results enable the creation of asymptotically valid confidence intervals.
4 Hypothesis tests wraps up Part I of the book by introducing statistical inference with hypothesis testing. This chapter shows how to build hypothesis tests and provides intuition for all their aspects. We also cover power analyses for planning studies and the connection between confidence intervals and hypothesis tests.
Part II of the book focuses on one particular estimator of great importance to quantitative social sciences: the least squares estimator.
5 Linear regression begins by describing exactly what quantity of interest we are targeting when we discuss “linear models.” In particular, we discuss how a population best linear predictor exists even if the relationship between two variables is nonlinear. This provides a coherent basis for linear regression estimation as a linear approximation to a potentially nonlinear function. The chapter also shows how to interpret the coefficients in these linear regression models.
6 The mechanics of least squares introduces the more mechanical properties of the least squares estimator: how the estimator is constructed, its geometrical interpretation, and how influential observations may affect the estimates it returns. This chapter introduces the least squares estimator in matrix form and provides key intuition for understanding this compact notation.
Finally, 7 The statistics of least squares describes the statistical properties of the least squares estimator. The chapter shows how modeling assumptions affect the kinds of properties we can obtain. The weakest modeling assumptions allow us to derive the surprisingly strong asymptotic properties of least squares that we depend on in most settings. The chapter then shows how stronger assumptions such as linearity and normally distributed errors can provide even stronger results but that they do so at the expense of potential model misspecification.
Acknowledgements
Much of how I approach this material comes from Adam Glynn, for whom I was a teaching fellow during graduate school. Thanks to the students of Gov 2000 and Gov 2002 over years for helping me refine the material in this book. Also very special thanks to those who have provided valuable feedback including Zeki Akyol, Noah Dasanaike, Maya Sen, and Jarell Cheong Tze Wen.
Colophon
You can find the source for this book at https://github.com/mattblackwell/gov2002-book. Any typos or errors can be reported at https://github.com/mattblackwell/gov2002-book/issues. Thanks for reading.
This is a Quarto book. To learn more about Quarto books visit https://quarto.org/docs/books.
\(\,\) \(\,\)