Average Match Function paper

Mar 11, 2016

Guido Imbens, Geert Ridder and I just released the working paper “Identification and efficiency bounds for the average match function under conditionally exogenous matching”. A pdf copy can be found here.

This is one of those papers that took many years just to get into draft form. Thankfully I remained excited about working on it for the duration. That said, it does feel good to get a draft done!

The paper builds on our earlier work on the econometrics of assignment problems (a survey of this work can be found here.

To understand the types of empirical questions motivating the new paper it is probably easiest to discuss few examples. Consider an educational setting, where teachers must be matched with classrooms of students. Teachers vary in ways observable to the policy-maker, as do students. Is there some way a social planner could optimally assign teachers to classrooms (e.g., to maximize average measured student achievement or some other outcome)? Once one articulates the problem it is easy to think of other examples. Companies assign managers to teams, the Marines, lieutenants to platoons, even the labor market is increasingly modelled in terms of matches between workers and firms. Examples of joint production, where two or more agents come together to produce an “outcome”, are ubiquitous in economics. Surprisingly there is not a lot of methodological work on problems of this type. Instead empirical researchers often try to shoehorn methods developed for single-agent models to such settings.

At first modelling the effects of reallocations appears to be relatively straightforward, but an important complication arises relative to single agent models. In a standard single agent model we posit heterogenous agents (“firms”) choosing the level of a homogenous input (“capital”) which determines the outcome (“output”). In the standard program evaluation problem, homogeneity of the treatment, sometimes called SUTVA, is an important assumption.

In a two-agent problem, where interest centers on the average outcome associated with different types of matches, (e.g., when inexperienced teachers are assigned to small classes), one needs to confront the fact that the subpopulation of inexperienced teachers is likely rather heterogenous in terms of other atttributes. Some inexperienced teachers may have lots of prior experience in other types of work, others very little. Teachers adopt differing modes of instruction, have different educational backgrounds and so on. Not all of these differences will be observed by the econometrician.

Likewise the subset of, say, small classrooms may also exhibit lots of diversity in terms of student composition. These classrooms vary in their racial, socioeconomic and gender mix. They also will vary in prior measured achievement and in many other ways not easily observed.

Both agents in this setting, teachers and classrooms, are, heterogenous. No two inexperienced teachers are alike and all small classrooms are different! In turns out that this rather obvious point makes the extension of methods of covariate adjustment familiar from the single agent case to the two agent case non-trivial. Even worse, my experience suggests that intuitions from the single agent case sometimes just lead to dead ends.

One of the main contributions of the paper (at least in my view) is to establish a notation and language to help one think clearly about how to measure the expected effect, for example, of different types of classroom-to-teacher matches. Once we had that infrastructure in place most of the formal results in the paper, at least in skeletal form, followed pretty quickly.

The paper is not long (about 20 pages of main text), but like many modern econometrics papers it is a bit dense. In this post may aim is to convey a sense of the mechanics of the covariate adjustment process introduced in the paper. I won’t bother outlining all the assumptions needed for this process to result in a “structural” or “causal” object and instead will just proceed heuristically. My hope is that a few simple worked examples will complement what you can find in the paper.

It’s likely that what follows won’t make perfect sense after an initial read. If this is the case I suggest quickly reading the paper before coming back to this post. Hopefully after a couple of iterations back-and-forth the key ideas will become clear. I believe the underlying ideas introduced in our paper are quite natural, but they took some time to register in my own mind, in part, because I found no good analogs in prior work.

Covariate adjustment in single agent models

In order to explain what we do in our paper it is helpful to begin by reviewing a familiar application of covariate adjustment: the adjustment for differences in pre-treatment characteristics across treatment $T=1$ and control $T=0$ units.

Let $ T\in\left\{ 0,1\right\} $ be a binary treatment indicator. Treatment could indicate participation in some sort of training program, or receipt of a particular medical treatment. These details are not important for what follows. Assume that the population is evenly divided between, say, men and women. Average outcomes under treatment and control for men and women, as well as the gender-specific average treatment effect, are given in the following table:

	T=0	T=1	Treatment Effect
Men	0	0	0
Women	0	2	2

The table indicates that, while men do not benefit from the treatment, on average, women do, averaging an outcome of 2 under treatment, but zero under control.

The Average Treatment Effect (ATE) in this population is

$$\mathrm{ATE}=\frac{1}{2}\times0+\frac{1}{2}\times2=1$$

reflecting the weighted average of the strong benefit experienced by half the population (women) and the zero benefit experienced by the balance of the population (men).

Now consider the structure of actual outcomes. If we assume, plausibly enough, that women gravitate towards getting the treatment due to its high average benefit to them, while men do not (reflecting the absence of a benefit, on average, for them), then it is possible that the observed average difference in the outcome across the treatment and control subpopulations will be very different from the ATE.

To be specific assume that the joint distribution of gender and treatment status is given by:

	T=0	T=1
Men	0.4	0.1
Women	0.1	0.4

That is, while only 20 percent of men (0.1/0.5 = 0.2) receive treatment, 80 percent of women do. Under this treatment assignment pattern the observed average difference between outcomes in treatment and control will be 1.6; which is much more than the ATE of 1. This reflects what is generally called selection bias or confounding. This bias arises because those individuals (i.e., women) who are more favorably affected by the treatment disproportionately seek it out.

Covariate adjustment in multi agent models

The approach to covariate adjustment outlined above has a long history in statistics (see this paper for a historical perspective). Sometimes the motivation for covariate adjustment is to conduct causal inference, but often the motivation is simply to “adjust for” observed differences. As an example demographers routinely age-adjust mortality rates. The view being that it is more meaningful to compare mortality rates in Palm Beach, FL and San Francisco, CA after first adjusting for differences in the age distribution in these two cities. As a side note, I think covariate adjustment is often an interesting and important step in the social science research process, even if such adjustments do not ultimately justify causal statements.

Our paper introduces a method of covariate adjustment suitable for what we call multi-agent models. We specifically consider two-agent models that arise in, for example, one-to-one matching problems. In the paper we develop connections between our approach to covariate adjustment and recent structural models of matching (see this paper for an important example). Indeed, I view making these connections another main contribution of our paper. However here I’ll maintain the narrow focus on our approach to covariate adjustment.

In the two-agent setting there are, naturally, two populations. Call the first population teachers. Teachers have – to keep things simple – just two observed binary attributes. For concreteness we will let $W\in\left\{ 0,1\right\}$ indicate whether a teacher has two or more years of teaching experience (i.e., is “experienced”), $W=1$ , or less, $W=0$ (i.e., is “inexperienced”). The additional covariate $R$ equals one for teachers who, say, scored in the top half of their cohort on their licensure exam and zero otherwise. There are four different configurations of covariate values for teachers. The frequency with which we observe each of these four configurations is given in the following table:

	R=0	R=1
W = 0	0.25	0.25
W = 1	0.25	0.25

To simplify the calculations I just assumed that all four combinations are equally likely. In practice there will often be dependence between $W$ and $R$ . Such dependence is interesting, allowed for in our paper, but makes following the train of logic below much more difficult. Indeed, perhaps after reading this post, you will appreciate that it is somewhat surprising that covariate adjustment is needed at all when $W$ and $R$ are independent (as assumed in this example, but not in general).

The second population consists of classrooms. Classrooms may have a small number of students (say less than 20), $X = 1$ , or a “large” number of students, $X = 0$ . Classrooms may also be in the top half of the distribution in terms of measured prior achievement, $S=1$ , or the bottom half, $S=0$ . Again, to keep thing simple, I just assume that all four combinations are equally likely:

	S=0	S=1
X = 0	0.25	0.25
X = 1	0.25	0.25

What does the econometrician observe? She observes a specific matching of teachers to classrooms and an associated outcome for each match (e.g., the classroom average end-of-year achievement test score).

Now say a superintendent wanted to compare average student achievement when inexperienced $W=0$ teachers are assigned to small $X=1$ versus regular $X=0$ classrooms. Perhaps she is interested in a change in the teacher assignment rule that would disproportionely assign inexperienced teachers to small classrooms.

For any given $W = w$ to $X = x$ type of teacher-to-classroom assignment there are four types of “sub-matches”. For example a low licensure score $R=0$ teacher can be assigned to a class of low prior achievement $S=0$ or high prior achievement $S=1$ . Consider the set of inexperienced $(X=0)$ teachers in small $(W=1)$ classrooms. Average end-of-year test scores for each of the four possible combinations of $R$ and $S$ are given in the following table.

	S=0	S=1
R = 0	0	2
R = 1	2	8

The table suggests the presence of complementarity between a teacher’s licensure score and high prior student achievement. The students of teachers with higher licensure scores tend to perform better on end-of-year tests (compare rows 2 and 1 of the table above), but this effect is especially marked when assigned to a classroom consisting of students with high prior achievement. Likewise students with high levels of prior achievement tend to do better on end-of-year tests than students with low levels of prior achievement (compare columns 2 and 1, respectively of the table), but again this effect is especially marked when matched to a teacher with a high licensure score.

Lets assume the superintendent is intested end-of-year test scores when a “typical” (i.e., randomly sampled) inexperienced teacher is assigned to a “typical” (i.e., randomly sampled) small classroom. To be more precise we are interested in the expected outcome when a randomly drawn inexperienced teacher is matched with an independently randomly drawn small classroom. In the paper we call this object, which is a function of $W=w$ and $X=x$ the Average Match Function (AMF).

At this stage you might ask why the superintendent doesn’t also condition her assignment on a teacher’s licensure school and prior classroom achievement? In practice she may want to do this, however in the paper we focus on policies which involve the characteristics $W$ and $X$ alone. The role of $R$ and $S$ is to facilitate bias reduction. This is analogous to the program evaluation problem, where the researcher’s interests is in the ATE, not the treatment effect conditional on covariates.

Let $q(w,x,r,s)$ be the expected end-of-year average test score when a teacher with attributes $W=w$ and $R=r$ is paired with a classroom with attributes $X=x$ and $S=s$. In mathematics the superindendent would like to know

$$\beta\left(w,x\right)=\sum_{r=0,1}\sum_{s=0,1}q(w,x,r,s)\times\Pr\left(\left.R=r\right|W=w\right)\times\Pr\left(\left.S=s\right|X=x\right)$$

Note the average in this equation is over the product of two distributions. In our simple example there is no dependence between $W$ and $R$ and $X$ and $S$ , but there could be in general. This is why the average is over the product of two conditional distributions. In the paper we show that the estimand above, under a “conditional exogenous matching” assumption, has a structural, or causal, interpretation. I want to gloss over this aspect now, although I think it is quite important for evaluating what is, and what is not, learned from specific empirical work.

Using the numbers in our example we can evaluate the right-hand-side of $\beta\left(w,x\right)$ as

$$\beta\left(w,x\right)=0\frac{1}{2}\frac{1}{2}+2\frac{1}{2}\frac{1}{2}+2\frac{1}{2}\frac{1}{2}+4\frac{1}{2}\frac{1}{2}=3.$$

When the “typical” inexperienced teacher is matched with a “typical” small classroom we would expect end-of-year test scores to average 3.

In practice, however, it is unlikely that, among the subset of classrooms with inexperienced teachers and small numbers of students, teachers and classes are paired-up randomly. Given the pattern of complementarity between teacher licensure score and prior student achievement assumed above, it seems plausible to think that, among inexperienced teachers assigned to small classes, high licensure score teachers disproportionately match with classrooms with high prior achievement. We are worried about sorting.

Imagine that inexperienced $W=0$ teachers are equally divided between small $X=1$ and large classrooms and likewise for experienced $W=1$ teachers. Further, to make things a bit more dramatic, we assume that this equal division occurs randomly. This ensures, for example, that a teacher’s licensure score does not predict whether her class is small or large.

However, within a given $W=w$ to $X=x$ set of matches, we assume that teachers and classrooms pair up in some decentralized way. Within $W=0$ to $X=1$ matches we might observe that following frequency of sub-matches:

	S=0	S=1
R = 0	0.4	0.1
R = 1	0.1	0.4

The table above suggests that within $W=0$ to $X=1$ matches, eighty percent of high licensure score teachers match with classrooms with high prior achievement (0.4/0.5=0.8) Whereas eighty percent of teachers with low licensure score teachers match with classrooms with low prior achievement. Actual end-of-year achievement across the set of $W=0$ to $X=1$ classrooms (i.e., those with inexperienced teachers and small numbers of students), will be affected by this pattern of sorting.

Note that I have cooked up this example so that there is nothing special about the inexperienced teachers assigned to small classrooms. Their licensure score distribution is the same as that of teachers assigned to large classrooms. Likewise there is nothing special about the small classrooms assigned to inexperienced teachers. The distribution of prior student achievement in these classrooms is the same as that in small classrooms assigned to experienced teachers. This sounds like a setting where “doing something obvious” might work. Unfortunately this is not the case.

Average end-of-year achievement across $W=w$ to $X=x$ matches is given by

$$\mathbb{E}\left[\left.Y\right|W=w,X=x\right]=\sum_{r=0,1}\sum_{s=0,1}q(w,x,r,s)\times\Pr\left(\left.R=r,S=s\right|W=w,X=x\right)$$

If we evaluate this expression for $W=0$ to $X=1$ matches we get

$$\beta\left(w,x\right)=0\times0.4+2\times0.1+2\times0.1+8\times0.4=3.6$$

Even though there is no selection into the $W=0$ to $X=1$ matches, the sorting of teachers and classrooms within such matches biases our estimate of their effect. Remember our target estimand is the expected end-of-year achievement in a classroom where a typical inexperienced teacher is assigned to a typical small classroom. Above we calculated this effect to be 3, which is below the 3.6 achievement level actually observed in $W=0$ to $X=1$ matches.

Admittedly, even in this cooked and very simple example, there is a lot going on. My recap of covariate adjustment in single agent models took a few paragraphs and was, I think, relatively straightforward. Even after the much longer exposition given for the two-agent case you could be forgiven for still being confused. Furthermore, there are many special features of the example that are neither required, nor likely to occur in practice. In thinking about the example a helpful exercise might be to recall a setting where you were concerned about “sorting bias”, but perhaps didn’t have the language or notation to formalize your concerns. Maybe the set-up in our paper could help?

In the current paper I think Geert, Guido and I present a useful way for thinking about how to measure the average outcome across different types of matches. Nominally, the current paper dates to an invited presentation I gave at the 2012 European summer meetings of the Econometric Society in Malaga, Spain. In reality we first started thinking about these problems over ten years ago. I think the most important parts of the paper are conceptual, I also think they are the hardest to understand and appreciate. The formal results in the paper are not trivial, the efficiency bound calculation, for example, required some insight, but neither are they exceptionally difficult or non-standard. What was hard was stating the question, finding a notation conducive to getting some results, and making the first few conceptual steps.

I am excited about the paper. I think there is scope for some very interesting empirical applications and also further methodological work. While the paper has been presented at a number of conferences and workshops, it took a long time to complete an initial “working paper” that we could freely circulate. I am hopeful that with a draft-in-hand, we’ll get a new wave of interesting comments, feedback and criticism. I’d be especially interested to hear about empirical projects where are methods might be useful.

A copy of the paper can be found here. It should also appear as a CEMMAP, NBER and USC-INET working paper shortly.