Chris Blattman has a fine rant about matching as a statistical method for program evaluation. That in turn engendered a response from Andrew Gelman and a reply from Blattman. This post is my response to them both and to the broader questions they raise in their post.
Perhaps the nice thing about all this is that the three of us generally agree on the main point: matching is not a magic bullet. Just because you estimate a propensity score and then run psmatch2 in Stata does not make the selection on observed variables assumption any more true than it was when you were running a linear regression of the outcome variable on the conditioning variables and a treatment indicator.
In my graduate applied econometrics class, we are just finishing the discussion of matching and weighting methods. In my lectures, I make the point that parametric linear regression and matching methods differ in four main ways:
1. Matching relaxes the functional form restrictions inherent in parametric linear regression the way in which it is normally used in applied work, which is to say with each conditioning variable entered linearly and few, if any, higher order terms.
2. Matching focuses attention on the so-called overlap or "common support" condition, which considers whether there are untreated units that "look like" each treated unit in terms of their observed characteristics. With a parametric model, it is easy to rely on the functional form to fill in where the data are absent without knowing that you are doing so. Matching makes that much harder.
3. In one of the "usual" notations, parametric linear regression requires E(U | X, D) = 0 while matching requires E(U | X, D = 1) = E(U | X, D = 0), where U is the "error" term, X are conditioning variables and D is the treatment indicator. This difference in conditions may affect the set of reasonable X. For example, a lagged Y might satisfy the matching condition but not the parametric linear regression condition.
4. As noted by one of the commenters at Gelman's blog, in a heterogeneous effects world, parametric linear regression and matching have different estimands. Matching estimates the impact of treatment on the treated (in the usual case) while parametric linear regression estimates a different weighted average of treatment effects. Angrist has been making this point for a while - see his 1998 Econometrica paper and his new Mostly Harmless Econometrics book with Steve Pischke - but it remains under-appreciated within economics. Perhaps oddly, it is widely understood by sociologists.
A few other points:
1. Thinking about matching as a way of selecting comparison observations is really just a special case of thinking about matching as a weighting estimator. It is a special case because all the weights are integers (or, in the case of single nearest neighbor matching without replacement, they are all one of just two integers: 0 and 1). See equation (10) of Smith and Todd (2005) Journal of Econometrics.
2. One reason to prefer thinking about matching as a version of weighting is that it pushes you away from doing nearest neighbor matching, which the literature pretty clearly shows to have inferior performance relative to its alternatives in terms of mean squared error. For the latest on that literature, see the papers by Busso, DiNardo (get well soon!) and McCrary on McCrary's web page at Berkeley law school.
3. One reason to prefer thinking about matching as an application of non-parametric regression, which is how I teach it in my class, rather than in terms of comparison group selection, is that it makes clear that matching fits much more neatly into our existing stock of econometric and statistical knowledge than it might at first seem.
4. I don't think we fully understand the statistical properties of matching treated as a "pre-processor" in the sense of this paper by Ho, Imai, King and Stuart, which Gelman seems to have in mind in part of his discussion. We do know that doing some statistical procedure on a sample obtained by some sort of matching and not taking note of the pre-processing in the construction of the standard errors will make for misleading inferences.
5. Sometimes you can learn about what conditioning variables are required to make unconfoundedness hold in particular substantive contexts by running experiments. Indeed, to me this is one of the major values of experiments. For this reason, I argue that experiments should often be accompanied by parallel collection of the data required for a non-experimental evaluation designed to shed light on the variables that are, and are not, required for "selection on observed variables" to hold in a given context. For instance, we have learned a great deal about the variables required for selection on observed variables to hold in the context of evaluating job training programs in precisely this way. See, e.g., Heckman, Ichimura, Smith and Todd (1998) Econometrica (gated).
6. Contra Gelman, what you want is not all the variables that determine participation, but rather all the variables that determine both (not either but both) participation and outcomes. A variable that affects participation and not outcomes (other than through an effect on participation) is an instrument. If you have one, you should be using it to do an instrumental variables analysis. You do not want to be in the business of matching on instruments. Also, if you literally had all of the variables that determine participation, you could not do matching, because there would be no common support. Put more prosaically, in such a case, all of the treated observations would have estimated propensity scores of one and all the untreated units would have estimated propensity scores of zero.
7. I really like Gelman's point about the two tribes: those who think unobserved variables are always important, so that selection on observed variables is always wrong enough to lead to substantively important bias, and those who think that selection on observed variables can be true enough in particular, well-motivated contexts to yield reasonable results. I count myself a member of the second tribe, but have many (economist) friends in the first tribe. There is also a third tribe, which I think of as the "benevolent deity" tribe. They believe that whatever variables happen to be in the data set they are using suffice to make "selection on observed variables" hold. This tribe has a lot of members, particularly outside of economics. Indeed, it is probably the largest of the three tribes in the academy as a whole. If you do not believe this, read the chapters in Linda Waite's The Case for Marriage book that survey literatures untouched by economists.
Hat tip: Jess Goldberg
Who was my favorite student this term?
4 years ago