The blog of Michael Wiebe

A YIMBY FAQ

2024-09-12T17:00:00+00:00

What caused the housing crisis?

Skyrocketing housing costs in big cities are caused by demand for housing increasing faster than supply. The key mechanism underlying the housing crisis is the demand cascade: richer people move in and bid up the price of new homes, which pushes less-rich locals to compete for old homes, which forces the poor to take on roommates, move away, or become homeless.

Isn’t the problem a lack of affordable housing in particular?

The housing market is interconnected. When we don’t build enough new market-rate housing, we get demand cascades that bid up the price of old housing. Housing that was affordable becomes expensive. Solving the housing crisis will require building more market-rate housing and more subsidized housing; subsidized housing by itself is not enough.

Moreover, the issue isn’t only that prices are too high for current residents. We also have to think of the people who want to move into the city, but are priced out.

But new homes are not affordable, so building more will not help the people with the greatest need.

First, new homes function as a yuppie fishtank: by containing rich newcomers in glass towers, we absorb their demand and prevent a demand cascade. Hence, even if new homes are more expensive, they still help protect the stock of affordable old homes.

Second, the price of a home is not an intrinsic feature of the home. Rather, it is determined by both supply and demand. The key idea here is a multi-unit auction. There are 5 units of a product for sale, and 100 bidders. The price is the lowest winning bid. With 5 units, the top 5 richest bidders win, and the price is the bid of the 5th richest bidder.

What if there are 50 units for sale? Then the top 50 richest bidders win, and the price is the bid of the 50th richest bidder. Since the 50th richest is willing to pay less than the 5th richest, the price falls. This shows how the price depends on the level of demand relative to supply.

New homes do have higher costs than old homes, because they use new materials and modern construction practices. But when demand is high relative to supply, prices are driven far above construction costs, and lowering prices requires increasing supply.

You’re really saying that building luxury homes will solve the housing crisis?

Yes, building highrise condos is a step in the right direction. Apart from acting as yuppie fishtanks to absorb new demand, ‘luxury’ homes also free up affordable old housing through vacancy chains. When a local resident moves into a new apartment building, they free up their original unit; someone else moves into that unit, vacating their home; and so on, until homes in affordable neighborhoods are made available. Research shows that every new market-rate home added leads to 0.6 homes freed up in below-median-income neighborhoods.

The deeper issue is that highrise condos aren’t actually luxurious. True luxury is living in a big house on a quiet street with nice views, having a private yard and garage, and having easy access to big city jobs and amenities. This is what single-family zoning creates for the privileged few, at the expense of the many.

We’ve seen cities increasing housing supply, but prices keep going up. Doesn’t this disprove your supply theory?

On the contrary, this is exactly what a supply and demand model predicts would happen when demand is increasing and supply is (partially) restricted. Developers are able to build some new homes, but not enough to keep up with demand, so we still get a demand cascade, and prices of both new and old homes go up.

If more density was the solution, shouldn’t New York City be the most affordable city in the world?

Big cities are expensive because they offer amenities and jobs, so demand to live there is high. So big cities have high supply, but even higher demand. But it’s still true that increasing supply makes prices go down. If New York upzoned to allow more apartments, what would make it more affordable.

What about induced demand? Increasing supply just draws in more demand.

The idea of ‘induced demand’ is that we’re in a positive feedback loop where more people leads to more productivity, higher wages, and better amenities, which in turn leads to more people. NIMBYs use this to argue that increasing supply actually raises prices, so we need to block new apartments. But this argument is far more general than they realize.

The feedback loop is not driven by apartment buildings, but by more people. So if we build more suburbs and people move in and commute to their job downtown, they still contribute to higher productivity, driving the feedback loop. So to break out of the loop, we have to block all new housing, including in the suburbs.

And it goes even deeper. If college students study computer science and establish a tech sector, they are fuelling the feedback loop. When someone renovates an old building to open a high-end yoga studio or a fancy coffee shop, they are fuelling the feedback loop. If we’re really committed to avoiding induced demand, we need to clamp down on anything that could make the city nicer.

But this is absurd. Higher wages and nicer amenities are a good thing, and by assumption, these are net benefits, so the positives outweigh any negatives like increased congestion. Why should we turn down a net improvement?

And I suspect that NIMBYs will quickly abandon this argument once they realize that it commits them to opposing new single-family suburbs.

In any case, it’s not even clear whether the induced demand effect is big or small, or even negative. Maybe more people leads to more congestion, which induces people to move away. The effect likely varies by city, and by the size of the city.

For our purposes, induced demand is not an argument against legalizing apartment buildings.

Doesn’t upzoning increase land values? Higher land costs mean more expensive homes.

Upzoning a plot of land switches it from single-family to multi-family. From the landowner’s point of view, the price of that particular land will tend to rise, because apartments are worth more than houses. But from the developer’s point of view, the price of all multi-family land goes down, since there are now more options for land to build apartments on. The key is distinguishing between the price of the upzoned plot and the market price of all multi-family land.

And note that when the upzoning is broad enough, we increase the stock of multi-family land so much that a specific upzoned plot doesn’t increase in price. This happens when the marginal buyer is a single-family builder, who is only willing to pay single-family prices. Since the marginal buyer is what determines the market price, broad upzoning increases the option value of the land (because there are more legal uses) without increasing land prices.

Are there any upzoning success stories?

Not many. NIMBYs are politically powerful and have prevented reforms. It’s only with the housing crisis getting worse year after year that support for upzoning has started to win out.

In New Zealand, Auckland upzoned three-quarters of its residential land in 2016. The result: “six years after the Auckland Unitary Plan was enacted, rents for three-bedroom dwellings were 26–33 percent lower than they would have been, compared to rents in other urban areas in the country.”

What other factors have contributed to rising housing costs?

Some candidates for demand-side factors:

Demographics: with longer lifespans, Boomers are staying in their homes and not downsizing, making it harder for Millennials to find housing.
As divorce becomes more common, there are more separate households needing separate homes.
Cities have grown and run out of space to keep sprawling, or have reached the limit for how long people are willing to commute.
Jobs are increasingly moving to big cities.

People want to live in single-family homes, not apartments.

If so, then there’s no harm in zoning for apartments, because people will choose detached houses. Developers won’t build apartments, since they see that the market wants houses. The true test of what people want is to legalize apartments and see what happens on a level playing field.

The deeper issue here is that some people want single-family houses, and other people are willing to economize on housing and live in an apartment if it gets them a shorter commute. Why should public policy give special preference to people who want houses?

If we only have apartments, no one will have kids and the population will collapse.

Under the status quo where apartments are hardly allowed, developers economize by making units smaller. If we allowed tall apartment buildings everywhere and didn’t micromanage their dimensions, we would have large family-sized units where people would be happy to raise kids. The key: to make apartments larger on the inside, we need to make them larger on the outside.

This is an expensive city. If you can’t afford to live here, move somewhere else.

The price of housing is determined by supply and demand; it isn’t an intrinsic feature. We can choose to increase the supply of housing, making it more affordable. Whether or not a city is expensive is a policy choice, and we can choose differently.

Zoning policy should prioritize maintaining the neighborhood character for current residents.

Public policy should maximize overall social welfare, and not merely cater to one special interest group. There is a tradeoff between preserving neighborhood character for incumbent residents and providing housing so that people can live near their job and family. The YIMBY argument is that providing shelter is more important than neighborhood character, so we should sacrifice a lot of character to get more housing. NIMBYs value neighborhood character more, so they’re willing to sacrifice a lot of housing to preserve it.

If we are trying to do what’s best for society as a whole, we should legalize apartments, because (1) the number of people who would benefit from living near their job and family is much larger than the number of people who would lose their mountain view and easy street parking; and (2) the magnitude of the benefits gained by new residents from having housing is much larger than the magnitude of the losses in neighborhood character faced by incumbent residents.

Who are the winners and losers from upzoning?

A common argument is that homeowners oppose new housing to protect the financial value of their investment. This is not quite right; wouldn’t homeowners want to legalize apartments so they can sell to a developer for bags of cash? The actual pattern of financial winners and losers is more subtle.

Upzoning means replacing detached houses with apartments. Centrally-located houseowners benefit, because detached houses become scarcer and hence more valuable. In contrast, houseowners in the suburbs would lose, as new apartments in the city induce people to migrate from the suburbs to the city center, reducing demand for homes in the suburbs.

Apartment owners also lose, as the increased supply of apartments provides direct competition. Meanwhile, renters unambiguously win, with more rental options to pick from.

In the city center, housesellers gain from upzoning, because they now have the option of selling to an apartment-developer. Correspondingly, housebuyers lose, because houses are scarcer and more valuable. This is reversed in the suburbs, since detached houses are cheaper there.

So central houseowners are a clear winner from upzoning, in contrast to claims that houseowners oppose apartments in order to protect their property value. But this is for financial benefits. What about non-financial benefits?

Long-term houseowners (who aren’t looking to sell and plan on living there for decades) lose neighborhood character as apartments are built. They have less privacy, streets and sidewalks are busier, street parking is more difficult, there’s more noise, and new apartments block their views. (On the other hand, they gain from new cafes and stores, and new housing for their friends and children to live in.) So because central houseowners oppose upzoning, we infer that they care more about nonfinancial than financial benefits. (And note that suburban houseowners generally can’t vote in the city, so the fact that they lose financially is politically irrelevant.)

To make housing more affordable, do we need to make it a bad investment for current homeowners?

From above, we can see that this question is poorly framed. If we’re talking about upzoning, housing becomes more affordable by building cheaper types of homes (townhomes, apartments). This makes owning a townhome or apartment a worse investment, but makes owning a house a better investment. Suburban houses also become a worse investment as they get cheaper.

The deeper problem here is treating housing as a commodity. But it isn’t. There are different types, you can rent or buy, and the value of a home changes based on its location. Since housing is not a commodity, it doesn’t make sense for “housing” to become more affordable or for “housing” to become a worse investment. As we’ve seen, some types become more affordable, and some types become more expensive. The original question just doesn’t make sense.

(If we’re talking about improving affordability by building new subdivisions of detached houses or entire new towns, then we’d expect this new supply to reduce property values for owners living in nearby suburbs. Curiously, this is never raised as an issue.)

So how do NIMBYs explain the housing crisis?

Some NIMBYs accept supply and demand, but blame rising demand as the problem. In particular, if we just blocked immigration (and for the more extreme NIMBYs, deported landed immigrants), then there would be enough housing for native-born citizens.

Other NIMBYs think that landlords have become greedier or that “financialization” of the housing market (whatever that means) is the problem.

Do YIMBYs favor cutting immigration to reduce housing costs?

The point of housing policy is to provide shelter for people; reducing the number of people in order to make shelter cheaper is just giving up on the goal. (Notice: we could just deport everyone and trivially end the housing crisis!)

Immigration policy should not depend on housing policy. First choose an optimal immigration level, and then set your housing policy to accommodate it. Don’t restrict housing supply and then use that as an excuse to limit immigration.

There is a limit here; we can’t build enough houses for a billion immigrants in one year. But the housing market can adapt to real-world immigration targets if they announced in advance.

Do YIMBYs support developers?

It depends. Some developers fight through zoning and permits to provide us with a fundamental need: housing. Many developers shut down or do not enter in the first place, deterred by onerous regulations.

On the other hand, big developers have established close relationships with city planners, allowing them to navigate the regulatory thicket. Big Dev doesn’t want clear, simple rules that would open the industry to new competition. Instead, they support a discretionary system where you need to be friends with the director of planning to get anything built.

If YIMBY is correct, why haven’t you won yet?

The coalition of self-interested YIMBY supporters is dispersed. Young people aren’t thinking about buying a home, so they don’t follow debates about zoning. Young homebuyers do care about housing policy, but they’re looking to buy now, and any activism won’t have an effect on current prices. Recent buyers who moved to the suburbs and would have lived in an apartment in the city are now settled down and aren’t likely to uproot themselves.

We should expect the YIMBY coalition to be fractured. If it was easy to organize, housing wouldn’t be such an intractable issue.

In contrast, NIMBYs are easily organized. They are established homeowners who are already active in local politics, and it’s easy to walk down the street and find allies. New housing imposes a clear cost (noise, congestion, lost views, more difficult street parking). NIMBYs being naturally well-organized based on self-interest explains why they have been so successful.

This suggests that YIMBY success depends on forming an ideological coalition, rather than a self-interested one. To overcome the self-interest of NIMBYs, we need people to fight for justice.

Isn’t upzoning violating the property rights of homeowners?

On the contrary. Upzoning is giving homeowners a legal right they didn’t have before: the right to build apartments on their own land.

Some NIMBYs take an alternative view of property rights that includes preventing their neighbors from building apartments on their land; this is a non-standard view, to say the least.

When I moved into this neighborhood, it was zoned for single-family houses. Changing that is a violation of an implicit contract.

It would have to be an implicit contract, because it sure isn’t listed on the title deed. Yes, homeowners often expect their neighborhood to remain the same as when they bought it. But this doesn’t overrule the property right of the landowner to build an apartment.

Having this expectation is also unreasonable in many cases. If you bought a house in a growing city and expected growth to stop after you moved in, then you have poor judgment. That’s just not how cities work. If you were attracted by jobs and amenities, then you should have noticed that the economy was thriving, and other people would be similarly drawn in. If you wanted a neighborhood that wouldn’t change, you should have moved into a declining city.

But property rights aren’t absolute. The local government democratically chose single-family zoning, so it’s undemocratic for a state government to force upzoning.

The problem with this argument is that it gives no voice to citizens who currently live elsewhere but want to move in. What’s really undemocratic is how local control gives a voice only to incumbent residents. In fact, federal control over zoning is arguably a better option, because housing policy determines whether people can move across the country, and state governments are not accountable to out-of-state migrants.

Tokyo is governed by a national zoning law:

Instead of allowing the people who live in a neighborhood to prevent others from living there, Japan has shifted decision-making to the representatives of the entire population, allowing a better balance between the interests of current residents and of everyone who might live in that place.

Upzoning will destroy heritage neighborhoods

Old neighborhoods will change; whether they will be “destroyed” is up for interpretation. The key issue is the cost of preserving heritage: are we willing to accept high housing costs in order to turn some neighborhoods into a museum exhibit? This should be decided on democratically by everyone.

If people want to preserve heritage houses or neighborhoods, they’re always welcome to pay the market price and keep it as is. And if they’re not willing to pay that much, it just shows that heritage is a suboptimal use of the land (since apartment-dwellers, via developers, are willing to pay more). We shouldn’t subsidize heritage at the expense of housing affordability.

A supply and demand model of housing, part 2: unit demand

2024-09-05T17:00:00+00:00

In the last post I used continuous quantities of housing: people could buy, say, 1.26 homes. Here I show how a supply and demand model works with discrete quantities, and with a unit demand constraint, where people buy at most one unit of housing. This involves a different than the perfect substitutes case, but we arrive at the same conclusion regarding vacancy chains, demand cascades, and yuppie fishtanks.

A supply and demand equilibrium

Consumers start with a willingness to pay for Old and New apartments. They buy the housing type with the largest difference between valuation and price: \(v-p\). To satisfy the unit demand constraint, their demand only counts in the market of the type they choose. So one rich person could buy up all of the homes, but the unit demand constraint prevents this, by requiring them to buy their highest-valued type, and removing their demand for the other types. Here, there are five consumers, and everyone value New apartments more than Old.

I’ll show an initial equilibrium where Person 1 buys a New apartment, and Persons 2 and 3 buy Old. Because Person 1 buys New, their demand is removed from the Old market. Hence, the aggregate demand curve for Old apartments looks like this:

Since each person demands one unit, aggregate demand tops out at four. Next, here’s the supply and demand equilibrium for Old apartments. The supply curve is vertical, representing a fixed stock of Old apartments. Since the supply curve overlaps the demand curve for some segment, technically the equilibrium is not well-defined. However, this is not a big deal, because we can pretend the supply curve is slightly sloping upward, so the intersection is unique (though this brings in continuous quantities). Alternatively, if we had a large number of consumers, the demand curve would be a straight line that intersects supply uniquely; working with discrete quantities is just a pain.

So Persons 2 and 3 each get an Old apartment, and the price is 2. Note that Persons 4 and 5 are initially priced-out, and don’t get a home. We can interpret this as having to move away, or living in their car (or being homeless).

Next we have the aggregate demand curve for New apartments. Here we show Persons 1, 4, and 5, because Persons 2 and 3 buy Old apartments, so their demand for New is removed (by the unit demand constraint).

In the initial equilibrium, only Person 1 gets a new apartment, and the price is 4. The supply curve is vertical, representing zoning constraints: developers cannot build any more, even if prices went up. Note that Persons 4 and 5 are priced out here as well.

Vacancy chains

To show a vacancy chain, we increase the supply of New apartments. This will allow Person 2 to upgrade, vacating their Old apartment, which in turn allows Person 4 to get a home (instead of being homeless).

The higher supply of New apartments reduces the price, as we move down the demand curve. This lower price induces Person 2 to re-evaluate their housing choice. Their willingness to pay is 3.7 for New apartments, and 2.2 for Old. The price for New apartments was 4, so they chose Old. But now the price is 2, so their payoff for a New apartment is 3.7 - 2 = 1.7, and the payoff for an Old apartment is 2.2 - 2 = 0.2 Hence, the New apartment is a better deal, so they switch markets from Old to New.

Person 2 upgrading means that we also change which market their demand contributes to. So we subtract their demand from the Old demand curve, and add it to the New demand curve, shifting it to the right. For New apartments, this pushes up the price. Note that we can end up out of equilibrium if this increase in price is not consistent with Person 2’s willingness to pay. Here, I’ll use the modelling flexibility we have from the supply and demand curve overlapping to choose a new equilibrium price of 3.

Since Person 2 upgrades, we remove their demand for Old apartments, shifting it left. Now Persons 3 and 4 are able to get an Old apartment, and the price falls to 1.6. (That is, supply intersects demand on the section of the demand curve where Persons 3 and 4 buy.) This is the vacancy chain in action: Person 4 was initially priced out and didn’t have a home, but by adding a New apartment, we have freed up an Old apartment for Person 4 to live in!

To make sure this is an equilibrium, let’s check the payoffs. Person 1 is happy to buy New, because they get 4.7 - 3 = 1.7 from New and 2.6 = 1.6 = 1 from Old. Person 2’s payoffs are 3.7 - 3 = 0.7 (New) and 2.2 - 1.6 = 0.6, so they buy New. Person 3 gets 2.2 - 3 < 0 for New and 2 - 1.6 = 0.4 from Old, so they buy Old. So each consumer’s choice is consistent, hence, these prices define an equilibrium.

Note that with unit demand, we are able to have consumers being priced out (like Person 4 initially). With continuous quantities, this was not possible, since they could always buy some infinitesimal quantity, and so can’t be priced out completely. But with discrete units, Person 4 initially doesn’t get anything, then thanks to the vacancy chain, they get an Old apartment.

Demand cascade and yuppie fishtank

Now let’s see how demand cascades and yuppie fishtanks work in the unit demand case. Since people can be priced out entirely, demand cascades will push people out of old apartments, and yuppie fishtanks will allow them to avoid being pushed out.

We’ll use the same willingness to pay as before, but this time we’ll start with Person 1 living outside of the city. Person 2 starts in a New apartment, so they don’t contribute to demand for Old apartments. That leaves Persons 3, 4, and 5 making up the aggregate demand curve for Old apartments.

There are two Old units, so the equilibrium has Person 3 and 4 each getting an Old unit at a price of 1.6.

For New apartments, Persons 2 and 5 contribute to the demand curve.

But there’s only one New unit supplied, so only Person 2 gets a New unit in the equilibrium. The price is 3, which ensures that everyone is choosing their best option. Person 5 is priced out of both housing types, and has to live in their car.

The demand cascade starts with Person 1 moving in and bidding for a New apartment. This shifts the demand curve to the right (from D1 to D2a), increasing the price from 3 to 4.

Since Person 1 is richer, they outbid Person 2 for the New apartment, so now Person 2 downgrades to an Old apartment. Hence, we remove their demand from the New demand curve; this does not affect the new equilibrium. Now only Persons 1 and 5 contribute to the aggregate demand curve.

Turning to Old market, we add in Person 2’s demand, shifting the demand curve to the right. This pushes up the price, from 1.6 to 2. Now Persons 2 and 3 get an Old apartment, and Person 4 is priced out and doesn’t have a home. This is the demand cascade: when high-income Person 1 moves in, they bid up the price of new apartments, leading Person 2 to downgrade to an old apartment, which leaves Person 4 being priced out. In short, when supply is restricted, rich people moving in forces poor people to move out.

But what if supply is not restricted, and grows to match the increase in demand? This is the yuppie fishtank: by building a new glass tower to contain Person 1, we absorb their demand and protect the old apartments from competition. So we shift supply from 1 to 2, reducing the price.

This lower price induces Person 2 to upgrade back to a New apartment. So the demand curve shifts right, and the price is 3. This is the same as the initial equilibrium: the yuppie fishtank has absorbed Person 1’s demand, keeping prices flat.

And since Person 2 is not competing for an Old apartment, the demand curve shifts left. Thanks to the yuppie fishtank, Person 4 gets to stay in their Old unit, and isn’t priced out.

Notice that a yuppie fishtank can cancel out a demand cascade caused by a rich newcomer, while a vacancy chain triggered by a rich local upgrading is an inverse demand cascade.

See here and here for code to produce the graphs.

A supply and demand model of housing, part 1: continuous quantities

2024-08-20T17:00:00+00:00

When we build new apartments, the people moving in vacate their old apartments; this reduces competition for old housing, making it more affordable. This is the vacancy chain effect: building expensive new housing helps improve affordability of cheaper old housing.

When rich people move into a city, they outcompete locals for new homes; those locals in turn compete for the stock of old homes, raising prices for poor people. This demand cascade is the mechanism driving the housing crisis. We can reverse demand cascades by building yuppie fishtanks to absorb the demand of rich newcomers.

In this post I’ll show how to use supply and demand to model vacancy chains, demand cascades, and yuppie fishtanks. I explain how to formalize the key mechanism in these scenarios, where changes to one market have effects on another market. This post treats quantity as continuous (so people can demand 1.26 homes, say) to show how standard microeconomics applies to housing; a future post will consider discrete quantity and unit demand.

A supply and demand equilibrium

Let’s consider an example of n=3 consumers with perfect substitutes preferences. People choose between Old and New apartments, and have a housing budget to spend. They differ in how much they prefer New vs Old apartments, and in the size of their housing budget. With perfect substitutes, you choose one good or the another, and spend your entire budget on the chosen good. For example, Person 1 will buy Old if price(Old) is less than half of price(New). Conversely, they buy New if price(New) is less than double price(Old). (We use two housing types as the simplest model that demonstrates the key mechanism; in reality, there are dozens of quality gradations.)

Here are the individual demand curves for Old apartments. Preference for Old (vs New) is decreasing from Person 1 to Person 3 (since Person 1 chooses an Old apartment at the highest price) while budget size is increasing (at price=2, Person 3 demands the largest quantity). Since New apartments are better, everyone gets more utility from a New apartment compared to an Old one, but they differ in how much more.

People choose between Old and New apartments based on a threshold determined by their preferences and the ratio of prices of Old and New apartments. Demand for Old apartments is equal to 0 above the threshold (here the demand curve overlaps the y-axis), is indeterminate at the threshold (the horizontal segment), and is equal to budget divided by price below the threshold (recall the shape of a 1/x function). Technically, the demand function has a discontinuity at the threshold, so you can imagine the horizontal part as being a dotted line. Also note that I’m referring to quantity demanded as a function of price, even though price is on the y-axis. Here I’m following the convention to plot price against quantity.

For example, Person 1 (blue) buys an Old apartment when the price is less than 9. Their quantity demanded is equal to their income (10) divided by the price; at price=2, this is 10/2=5. When the price is above 9, they demand 0 Old apartments and instead choose a New apartment; we’ll see that when we plot the demand curves for New apartments. When the price is equal to 9, their demand curve has a discontinuity and jumps across the horizontal segment.

Note that demand for Old apartments is a function of the prices of both Old and New apartments, but I’ve plotted it using only price(Old). What’s going on? The short answer is that I’m showing you one slice of the entire demand function, evaluated at a particular value of price(New). I’ll return to this later on.

Click below to see the underlying demand functions.

Math

Let \(x_{1}\) = Old apartments, \(x_{2}\) = New apartments. Let \(x_{i,j}\) be quantity demanded for consumer \(i\) of good \(j\), with income \(m_{i}\). Given perfect substitutes utility \(u(x_{1},x_{2}) = a x_{1} + b x_{2}\), we can derive the demand functions with the threshold defined by equating the marginal rate of substitution (\(a/b\)) with the price ratio (\(p_{1}/p_{2}\)). If \(p_{1}/a > p_{2}/b\), the consumer chooses \(x_{1}\) and spends their entire budget on it; otherwise, they choose \(x_{2}\). With perfect substitutes preferences, we get a corner solution: the optimal choice is one or the other, and not a mix of both. Note that the demand functions take both \(p_{1}\) and \(p_{2}\) as arguments, so they define a 3-dimensional surface. I use supply functions \(S_{1}(p_{1}) = 13\) (perfectly inelastic supply) and \(S_{2}(p_{2}) = \frac{3}{10} p_{2}\). \(S_{2}\) shifts to \(S_{2}^{new}(p_{2}) = \frac{50}{9} p_{2}\).

\[\begin{aligned} \begin{split} &u_{1}(x_{1},x_{2}) = 9x_{1} + 10x_{2} \\ &u_{2}(x_{1},x_{2}) = x_{1} + 4x_{2} \\ &u_{3}(x_{1},x_{2}) = x_{1} + 5x_{2} \\ &m_{1} = 10 \\ &m_{2} = 20 \\ &m_{3} = 30 \\ % \end{split} % \end{aligned} % \begin{aligned} % \begin{split} &x_{1,1} = \bigg\{ \begin{matrix} 0, p_{1} > \frac{9p_{2}}{10} \\ \frac{m_{1}}{p_{1}}, p_{1} \leq \frac{9p_{2}}{10} \\ \end{matrix} \\ &x_{1,2} = \bigg\{ \begin{matrix} \frac{m_{1}}{p_{2}}, p_{1} > \frac{9p_{2}}{10} \\ 0, p_{1} \leq \frac{9p_{2}}{10} \\ \end{matrix} \\ &x_{2,1} = \bigg\{ \begin{matrix} 0, p_{1} > \frac{p_{2}}{4} \\ \frac{m_{2}}{p_{1}}, p_{1} \leq \frac{p_{2}}{4} \\ \end{matrix} \\ &x_{2,2} = \bigg\{ \begin{matrix} \frac{m_{2}}{p_{2}}, p_{1} > \frac{p_{2}}{4} \\ 0, p_{1} \leq \frac{p_{2}}{4} \\ \end{matrix} \\ &x_{3,1} = \bigg\{ \begin{matrix} 0, p_{1} > \frac{p_{2}}{5} \\ \frac{m_{3}}{p_{1}}, p_{1} \leq \frac{p_{2}}{5} \\ \end{matrix} \\ &x_{3,2} = \bigg\{ \begin{matrix} \frac{m_{3}}{p_{2}}, p_{1} > \frac{p_{2}}{5} \\ 0, p_{1} \leq \frac{p_{2}}{5} \end{matrix} \\ \end{split} \end{aligned}\]

What does it mean to demand multiple apartments? A more complicated model would have unit-demand (ie. people demand either one Old or one New apartment), but it’s simpler to use continuous quantity to start. For now, you can interpret it as the square footage of the apartment, which is a continuous variable.

We get aggregate demand by summing up the individual demand curves: for each price, count the quantity demanded across consumers. This produces downward-sloping segments where we can identify which consumers contribute to aggregate demand. (Recall that the horizontal segments are discontinuities, so I have the arrows pointing at the sloped portions of the curve.) Since Person 1 has the strongest preference for Old apartments, they show up at the top of the demand curve, while Person 3 (with the weakest preference) shows up only at the bottom.

Next, let’s see the individual demand curves for New apartments. These are the reverse of demand for Old: because Person 1 had the strongest preference for Old (vs New), they have the weakest preference for New (vs Old). Again, budget size is increasing from Person 1 to 3. And keep in mind that these demand curves are evaluated at a specific value of price(Old).

Here’s aggregate demand for New apartments. Now Person 3 has the strongest preference, so they contribute to aggregate demand in all segments, while Person 1 contributes only in the bottom segment.

An equilibrium is prices \((P_{Old}, P_{New})\) that set quantity supplied equal to aggregate quantity demanded in both markets. This is where the demand curve being evaluated at the price of the other good is key: the equilibrium price in one market has to be consistent with the equilibrium price in the other. I’ll show an equilibrium where Person 2 initially buys an Old apartment. To capture the vacancy chain effect, we’ll increase supply of New apartments, lowering the price, which induces Person 2 to upgrade from Old to New and reduces demand for Old apartments, thereby reducing pressure in that market.

Here’s an equilibrium for Old apartments. To capture the stock of Old apartments being fixed, I use a vertical supply curve; no matter what the price is, developers can’t add any more old apartments (at least in the short run). I’ve chosen the supply curve so that Person 1 & 2 buy Old, while Person 3 buys New (ie. we’re on the second segment of the demand curve for Old apartments). The price is set by the intersection of the supply and demand curves; here, \(P_{Old} = 2.3\).

Here’s the corresponding equilibrium for New apartments. In this case, the supply curve is upward sloping, reflecting that developers are willing to build more units at higher prices (and that building them is permitted by zoning, which may be an unrealistic assumption). Supply intersects demand on the segment of the demand curve where only Person 3 buys a New apartment. Here the price is \(P_{New} = 10\).

You can confirm that these prices are the ones I used when plotting the demand curves for the other good. For example, the first graph shows the individual demand curves for Old apartments when price(New) = 10.

When a consumer buys one good, what happens to their demand for the other one? With perfect substitutes preferences, the demand is still there, but it can never be realized, because perfect substitutes involves a knife-edge threshold. For Person 2, if price(Old) \(<\) price(New)/4, they buy Old; otherwise they buy New. Here, that is 2.3 \(<\) 10/4 = 2.5, hence they buy Old. If the prices were different in a way that flipped the inequality, they would buy New. (But to be an equilibrium, the supply curve would have to match the quantity demanded.) In the next post on unit demand, I’ll show how this logic changes.

Stepping back, note that these demand functions are actually 2D slices of a 3D demand surface. Demand for Old apartments depends on the prices of both Old and New apartments, so it makes sense that the demand curve will be plotted as a function of two variables (the two prices). Here’s a plot of an individual demand surface (for Old apartments). The demand curves plotted above are slices of this surface, taken at specific values of price(New). If you rotate your head to the left, you can see it: when price(Old) is high, demand is 0 (overlapping the y-axis); the original horizontal discontinuity becomes vertical here in 3D; and as price(Old) goes to 0, quantity becomes large, following the 1/x shape.

Changing the price of the other good moves you along the surface, but in 2D this will look like shifting the demand curve. This is important, because changing the cross-price does not shift demand in the sense of changing preferences. Instead, we’re evaluating the same demand function at a different price. For example, moving from price(New)=8 to price(New)=4 does not represent a change in preferences, but the demand curves (evaluated as slices at those prices) will look different, reflecting different tradeoffs (i.e., as New apartments get cheaper, my demand for Old apartments falls).

Vacancy chains

Let’s see how a vacancy chain works. The idea is that building new apartments will induce people to upgrade from Old to New, thereby reducing demand for Old apartments and making them more affordable.

We start the chain by increasing the supply of new apartments. For example, upzoning reduces land costs, which makes developers willing to produce more at every price. Shifting from S1 to S2, we move down the demand curve, with a new price \(P_{New}=3\).

Note that now supply intersects demand in the segment where both Person 3 and 2 buy New. Induced by the lower price of New apartments (\(P_{New}\) falls from 10 to 3), Person 2 has upgraded from Old to New! So we should see a corresponding drop in demand for Old apartments.

Here’s the effect on Old apartments. The demand curve ‘shifts’ down from D1 to D2, and now supply intersects demand on the segment where only Person 1 buys Old. The price falls from \(P_{Old}=2.3\) to \(P_{Old}=0.8\). Following the intuition for subsitute goods, when price(New) falls, Old apartments become relatively less attractive, so demand for Old apartments decreases.

As mentioned above, this is not really a shift in demand, but a movement along the 3D demand surface caused by changing price(New). Person 2’s preferences haven’t changed, they just make a different choice at a different price.

So we’ve shown the vacancy chain effect: increasing the supply of new apartments reduces their price (from 10 to 3), which induces people to upgrade from old to new apartments, thereby reducing demand for and the price of old apartments (from 2.3 to 0.8). Thanks to the new apartments being built, Person 1 now gets all 13 of the Old units to themself, and at a lower price.

But if price(Old) fell, shouldn’t that reduce demand for New apartments too? Yes. These are substitute goods, so a reduction in the other price makes a good less appealing. In this case, demand for New apartments ‘shifts’ down, but not enough to change the equilibrium. Hence, there are no more changes, and we’ve converged on a new equilibrium, with \((P_{Old}, P_{New})\) = (0.8, 3).

If we added more consumers, we could show how the vacancy chain allows a poorer person to afford an Old apartment at the new lower price, when they were originally priced out. And with more housing sub-markets (with multiple degrees of quality, instead of only Old vs New), we could trace out the vacancy chain itself, with someone upgrading in each submarket and reducing demand for their original housing type, thereby enabling someone in the submarket below to upgrade.

Demand cascades and yuppie fishtanks

Now let’s look at demand cascades and yuppie fishtanks. When rich people move into the city, they increase demand for expensive new apartments. This pushes locals to downgrade, increasing demand for old apartments. The end result is higher prices for both old and new housing; sound familiar?

But when we increase the supply of new apartments to match the rise in demand, we can prevent the demand cascade and stop the price of old apartments from going up. This is a yuppie fishtank, where we build shiny glass towers to contain the rich yuppies moving in and absorb their demand for housing.

To illustrate, I’ll show an example with two people forming an initial equilibrium, which is disrupted by a rich person moving in.

Click below to see the underlying demand functions.

Math

Let \(x_{1}\) = Old apartments, \(x_{2}\) = New apartments. Let \(x_{i,j}\) be quantity demanded for consumer \(i\) of good \(j\), with income \(m_{i}\). Given perfect substitutes utility \(u(x_{1},x_{2}) = a x_{1} + b x_{2}\), we can derive the demand functions with the threshold defined by equating the marginal rate of substitution (\(a/b\)) with the price ratio (\(p_{1}/p_{2}\)). If \(p_{1}/a > p_{2}/b\), the consumer chooses \(x_{1}\) and spends their entire budget on it; otherwise, they choose \(x_{2}\). With perfect substitutes preferences, we get a corner solution: the optimal choice is one or the other, and not a mix of both. Note that the demand functions take both \(p_{1}\) and \(p_{2}\) as arguments, so they define a 3-dimensional surface. I use supply functions \(S_{1}(p_{1}) = 13\) (perfectly inelastic supply) and \(S_{2}(p_{2}) = \frac{20}{21} + \frac{41}{21} p_{2}\). \(S_{2}\) shifts to \(S_{2}^{new}(p_{2}) = \frac{20}{21} + \frac{1520}{63} p_{2}\).

\[\begin{aligned} \begin{split} &u_{1}(x_{1},x_{2}) = 9x_{1} + 10x_{2} \\ &u_{2}(x_{1},x_{2}) = x_{1} + 4x_{2} \\ &u_{3}(x_{1},x_{2}) = x_{1} + 15.6x_{2} \\ &m_{1} = 10 \\ &m_{2} = 20 \\ &m_{3} = 200 \\ % \end{split} % \end{aligned} % \begin{aligned} % \begin{split} &x_{1,1} = \bigg\{ \begin{matrix} 0, p_{1} > \frac{9p_{2}}{10} \\ \frac{m_{1}}{p_{1}}, p_{1} \leq \frac{9p_{2}}{10} \\ \end{matrix} \\ &x_{1,2} = \bigg\{ \begin{matrix} \frac{m_{1}}{p_{2}}, p_{1} > \frac{9p_{2}}{10} \\ 0, p_{1} \leq \frac{9p_{2}}{10} \\ \end{matrix} \\ &x_{2,1} = \bigg\{ \begin{matrix} 0, p_{1} > \frac{p_{2}}{4} \\ \frac{m_{2}}{p_{1}}, p_{1} \leq \frac{p_{2}}{4} \\ \end{matrix} \\ &x_{2,2} = \bigg\{ \begin{matrix} \frac{m_{2}}{p_{2}}, p_{1} > \frac{p_{2}}{4} \\ 0, p_{1} \leq \frac{p_{2}}{4} \\ \end{matrix} \\ &x_{3,1} = \bigg\{ \begin{matrix} 0, p_{1} > \frac{p_{2}}{15.6} \\ \frac{m_{3}}{p_{1}}, p_{1} \leq \frac{p_{2}}{15.6} \\ \end{matrix} \\ &x_{3,2} = \bigg\{ \begin{matrix} \frac{m_{3}}{p_{2}}, p_{1} > \frac{p_{2}}{15.6} \\ 0, p_{1} \leq \frac{p_{2}}{15.6} \end{matrix} \\ \end{split} \end{aligned}\]

Here are Person 1 and Person 2’s demand curves for Old apartments. As before, Person 1 has a stronger preference for Old apartments (relative to Person 2), and Person 2 has a larger budget.

We sum their demand curves to get aggregate demand.

Now let’s see the demand curves for New apartments. In this case, Person 3 is much richer and has a very strong preference for New apartments.

But they will move in later, so the aggregate demand curve does not include them (yet).

Here’s the initial equilibrium for Old apartments. Supply intersects demand on the segment where only Person 1 buys. The price is \(P_{Old} = 0.8\). As before, the supply curve is vertical, representing a fixed stock of Old apartments.

And here’s the initial equilibrium for New apartments. Supply intersects demand on the segment where only Person 2 buys, and the price is \(P_{New} = 3\). So we have Person 1 buying Old and Person 2 buying New.

Next, the rich Person 3 moves in. This shifts the demand curve for new apartments up (from D1 to D2a), resulting in a new equilibrium at \(P_{New} = 10\). But now supply intersects demand on the segment where only Person 3 buys, so Person 2 has been priced out and will downgrade to an Old apartment.

The increased price of New apartments makes Old apartments more attractive, so demand for Old apartments shifts up. Now supply intersects demand on the segment where both Person 1 and Person 2 buy, and the price rises from 0.8 to 2.3. The stock of 13 Old units is now shared by both Person 1 and 2. This is the demand cascade: rich people moving in and increasing competition in the market for new apartments results in higher prices of old apartments, making housing less affordable for the poor. (This is also known as up-filtering, where for a given home, poorer residents are replaced by richer residents.) Here I’ve shown how demand cascades over two steps; in reality, demand would flow down dozens of steps, representing different levels of housing quality.

To finish up the new equilibrium, we account for the effect of the higher price(Old) on the New market. Higher prices for Old apartments make New apartments for attractive, so demand shifts up (from D2a to D2b), but this doesn’t change the equilibrium.

Now let’s use a yuppie fishtank to reverse the effects of the demand cascade. We’ll increase the supply of new apartments enough to offset the increased demand from Person 3. Supply shifts from S1 to S2, and the price falls from 10 back to 3. Now supply intersects demand on the segment where both Person 2 and 3 buy, so Person 2 has upgraded from Old to New (just as in the vacancy chain).

Since Person 2 upgrades, demand for Old apartments goes back to the original level, and the price falls to 0.8. This is the yuppie fishtank: by building new apartments, we absorb the demand of rich newcomers and prevent prices of old housing from rising.

Tidying up, the fall in price(Old) shifts demand for New apartments down, but the equilibrium is unchanged.

To conclude, note that vacancy chains and yuppie fishtanks are closely related. A vacancy chain allows a local resident to upgrade, reducing demand for old housing. A yuppie fishtank absorbs demand from a rich newcomer, preventing a demand increase for old housing. And the reversal tests work as expected. Reducing the supply of new apartments (say, a new building burns down) causes a demand cascade, while reducing demand for new apartments (say, deporting yuppies) creates a vacancy chain.

See here and here for code to produce the graphs.

Notes on the new measles literature

2024-02-04T17:00:00+00:00

There are a handful of papers applying the ‘disease burden’ method to study the long-term effects of the measles vaccine. This method uses cross-sectional pre-treatment disease incidence interacted with a time series variable for vaccine access. This works for diseases like hookworm and malaria, which have geographic variation in climatic suitability for the parasites that cause disease. But, as I discuss in my comment on Atwood (2022), measles is extremely contagious, so everyone gets infected, and there is no long-run geographic variation in incidence. The disease burden method cannot be applied straightforwardly to measles. It would be enlightening to see a big picture analysis of which diseases this method can be used on.

Atwood (2022) studies the long-term effects of the measles vaccine on employment and income. Barteska et al. (2023) uses a similar approach for the effects on education. Both papers find positive effects, which makes a consistent story about the mechanism: contracting measles causes children to miss school, which reduces human capital, which reduces adult income. So the vaccine increases education, which increases adult income and employment.

One difference between these papers is that Atwood uses 1964 as the treatment year (the vaccine was introduced in 1963, so this should be 1963), while Barteska et al. emphasize that vaccine takeup was low until the 1967-68 immunization campaign. So in a sense, these papers are contradictory: if Atwood finds a treatment effect in 1964, then there are pre-trends in Barteska et al.’s specification. Atwood calculates measles incidence over 1952-63, while Barteska et al. use 1963-66, so the treatment groups could be different.

Both Atwood and Barteska et al. have a graph showing the decline in reported measles incidence after the vaccine.

This is taken as evidence that measles incidence declined more in the high-measles states. But this is also consistent with actual incidence being the same across states, and reporting capacity being different. Since measles was eradicated to 0, each state’s reported incidence will be reduced by ~100%. So this graph is also consistent with the story that there is no geographic variation in actual measles incidence.

In my comment on Atwood (2022), I ran an event study and found post-vaccine trends: the treatment effect is increasing for cohorts born after the vaccine, which doesn’t make sense, since there’s no variation in the treatment dose (vaccine access). Barteska et al. (2023) does run an event study, and doesn’t find the same post-vaccine trends. My event study, following Atwood, used a 16-year treatment window over 1949-1964, with comparison windows 1932-1948 and 1965-1980. Barteska et al. use an 8-year window over 1959-1966, with comparison windows 1950-1958 and 1967-1975. Note that in the Barteska et al. event study, the coefficients start increasing 8 years before the immunization campaign. In my event study for Atwood, the coefficients start increasing 16 years before the vaccine. If Barteska et al. are right that the vaccine was targeted towards younger children, then the treatment effect for ages 9-16 (at vaccine introduction) is not consistent with a vaccine effect.

What treatment variation is Barteska et al. using, if there’s no long-run geographic variation in actual incidence? Since they use a shorter window for calculating measles incidence (1963-66), it’s possible they are measuring differences in short-run incidence, based on differences in epidemic cycle timing. In contrast, Atwood calculates average measles incidence over 12 years, which would average out differences in cycle timing. So Barteska et al. could be capturing variation in the susceptible population; i.e., children who avoid measles and the negative effects of disease, and get vaccine-induced immunity instead of virus-induced immunity.

Chuard et al. (2022) extends this reasoning to explicitly capture variation in the susceptible population. They use an epidemiological SIR model to directly measure geographic variation in actual cases before the vaccine. This method should work for evaluating the long-term effects of the measles vaccine, since cohorts with low incidence get vaccinated and avoid disease, increasing their education and adult income. But they find no effects, while a model using pre-vaccine measles mortality finds effects similar to Atwood. They note that Atwood is using a different source of variation compared to the epidemiological model, but it’s still not clear exactly what that is: what is the difference between High- and Low-measles states?

Given the similarity between their results and Atwood’s, they conclude that Atwood’s variation (reported incidence) is a proxy for disease severity. But this treatment variation is ambiguous, because disease severity can be affected by worse viral outbreaks (e.g., getting a worse case of measles from a higher viral dose), initial health levels (the same viral dose causes worse disease in sick people), or health infrastructure (the same viral dose causes worse disease when health care is poor). The concept of ‘disease burden’ doesn’t seem to apply for a universal disease like measles. They also don’t discuss differences in health infrastructure leading to different reporting rates.

Moreover, if the results using mortality and incidence are similar, then presumably the pre-vaccine state averages should be correlated. They do not report this correlation. But Figure 4 shows share ever-infected and mortality for a sample of five states, and the correlation is negative:

WA has the highest share ever-infected and the lowest mortality, while AR has the lowest share ever-infected and the highest mortality. Also, if incidence is a proxy for disease severity, then it should be correlated with worse health outcomes.

Chuard et al. also make a few odd choices in research design. Like Atwood, they don’t run an event study to test for pre- or post-trends. They say (footnote 5) that they calculate birthyear using age and survey year, even though IPUMS has a birthyear variable. They also aggregate the data to the state-of-birth X year-of-birth X age level, instead of using the micro data and clustering standard errors.

There are several other papers on measles vaccines. Atwood and Pearlman (2023) does the same exercise as Atwood (2022), but for the 1973 vaccine in Mexico. This also faces the problem of long-run average incidence being ambiguous treatment variation. Noghanibehambari (2023) copies Atwood (2022), with birth outcomes as the dependent variable. Berg et al. (2023) studies the UK measles vaccine using the same identification strategy as Atwood, and finds no average effect on height or education.

Summary: replication of Moretti (2021)

2024-01-22T17:00:00+00:00

Moretti (2021) is about agglomeration effects in innovation: do inventors patent more when they’re around other inventors? In other words, does the size of tech clusters cause patenting? This question is relevant for housing policy, because high housing costs prevent inventors from congregating in tech clusters. So if agglomeration effects are large, then constraints on housing supply are hampering overall technological progress.

Moretti finds a positive correlation between the number of patents an inventor has and the size of the tech cluster they’re in. But since correlation is not causation, Moretti uses two other techniques to establish a causal link: an event study and an instrumental variables strategy. In my replication, I find that both supporting results have coding errors, and correcting the errors overturns the results. Hence, we’re left wondering whether the main finding is causal or not.

One source of confounding is selection bias, where promising young inventors select into large tech clusters. This would generate a correlation between patents and cluster size, but not from size causing patenting. To test this, Moretti uses an event study, where the event is inventors moving to a different city. By using cluster size before and after the move, we can see whether cluster size affects patenting. And if there is selection, we should see an increase in patenting in the years before an inventor moves.

Moretti finds a big increase in patents in the year an inventor moves, and no sign of selection. But there’s a coding error. The event study estimates one coefficient per year, by interacting the treatment variable with a year indicator. But Moretti did not do this interaction for the year of the move. So β0, the corresponding coefficient, is estimated using data from all years, instead of capturing the effect only in the year of the move. When I include the proper interaction, the big effect goes away. So the event study does not even provide evidence for agglomeration effects, let alone test for selection bias.

The other source of confounding is omitted variable bias driving both patenting and cluster size. For example, a city subsidizing biotech firms would increase both biotech patents and the size of the local biotech cluster. To address this, Moretti uses an instrumental variables strategy. The idea is to use the number of inventors in other cities as a proxy (or ‘instrument’) for your own cluster size, to avoid bias from factors like local subsidies.

When constructing this proxy, Moretti calculates the change over time in other-city cluster size. That is, we do this subtraction: “other-city cluster size this year” minus “other-city cluster size last year”. However, there’s another coding error. Moretti did not sort the data by city, so this subtraction is taken across different cities, which doesn’t make sense. The code is mixing up cities, doing “city A’s other-city cluster size this year” minus “city B’s other-city cluster size last year”. Correcting this error makes the results go away. As with the event study, the instrumental variable strategy does not provide evidence against confounding.

So there is a positive correlation between patenting and cluster size, but it’s unclear whether we should interpret it as a causal effect.

Can we detect the effects of racial violence on patenting? Replicating Cook (2014)

2022-04-03T19:00:00+00:00

A year ago, I wrote a short post looking at the data in Cook (2014) (sci-hub) (replication files) on the effect of racial violence on African American patents over 1870-1940. I discovered that the state-level panel data was strikingly imbalanced. With Lisa Cook in the news for being nominated to the Federal Reserve Board of Governors, I decided to revisit the paper more thoroughly. I find that the main time series result is not robust, and provide evidence that the panel data results are too noisy to be trusted.

Time series regressions

Cook has two measures of patents per year: (1) using the year the patent was applied for, and (2) using the year the patent was granted. In the paper, Figure 1 reports Black (and white) patents per million using grant year, while Figure 2 shows Black patents per million using application year. Comparing the two graphs, we immediately see that the scale differs by a factor of about 10. Here I merge the two datasets and plot the application-year and grant-year variables on the same graph.

There is a huge discrepancy between the two patent variables. Cook collected data on 726 patents over 1870-1940, but the average by grant-year is 0.16, while the average by application-year is 1.22.¹ ²

Cook’s replication data does not include the raw patent or population variables, so we can’t say for sure what’s going on here. But the average Black population (see Table 1) was roughly 10 million, and 0.16 grant-year patents/M * 10M * 71 years = 114, far fewer than the 726 patents recorded. In contrast, 1.22 application-year patents/M * 10M * 71 years = 866, which is in the ballpark of 726. Speculating, one possible explanation is that Cook calculated grant-year patents using the white population (average 75 million) in the denominator, giving 0.16 * 75 * 71 = 852 patents.³ Hopefully Cook will publish the raw data and we can resolve this.

In any case, the grant-year patent variable seems clearly flawed, while the application-year variable looks correct. Since the Table 6 results use the grant-year patent variable, we should run a robustness check using the application-year variable.

Table 6 uses time series data to estimate the effect of lynchings, riots, and segregations laws on patents. Column 1 uses race-year panel data, where the lynching and patent variables vary by race (but the riot and segregation law variables vary only by time). Columns 2 and 3 run time series regressions separately by race, allowing us to estimate differential effects of racial violence on patenting.

I am able to reproduce Table 6⁴, using grant-year patents:

As noted in the paper, lynchings and riots have negative effects on Black patenting, and the 1921 dummy has a large negative effect, corresponding to the Tulsa Race Riot.

For the robustness check, I redo Table 6 using application-year patents instead of grant-year patents. This specification actually seems more appropriate, since Cook’s mechanism is that racial violence deters innovation by Black inventors; so racial violence would first impact patent applications, and with a lag impact granted patents. So the effects should be stronger using the application-year variable.

The application-year variable is missing in 1940, which reduces the sample size for the robustness check by 1. To make a pure comparison, I re-run the grant-year regressions dropping 1940, and get similar results (see footnote⁵). Next, I run Table 6 using application-year patents:

The results are dramatically different: the negative effect of lynchings and riots disappears, as does the negative effect in 1921. If the grant-year patent variable is incorrect and the application-year variable is correct, then the paper’s main result is wrong.

Panel data regressions

In Tables 7 and 8, Cook uses state-level panel data over 1870-1940 to run regressions of patents on lynching rates, riots, and segregation laws. However, we can immediately see a problem: there are 49 states and 71 years in the data, but only N=430 observations. A complete, balanced panel would have 3210 observations, as the number of states grows from 38 in 1870 to 49 in 1940 (including DC; see code for details). So Cook is using 430/3210 = 13% of the full sample.

And the pattern of missing data is not random. Below I plot the number of observations by state and year. First, we see that the majority of states have fewer than 10 observations over 71 years.

Next, the sample size is increasing up to 1900 before dropping off and rising again starting in 1920.

Decomposing by region, we see that the Midwest and Mid-Atlantic regions are relatively overrepresented, while the South and West are relatively underrepresented.⁶

Moreover, consider how this imbalanced panel compares to the full time series. There are 726 patents in the time series, and 702 in the panel data (for 97% coverage). But the violence variables are drastically under-reported: there are 35 riots in the time series data, but only 5 in the panel data (14%). Similarly, there are 290 new segregation laws in the time series data, but only 19 in the panel data (7%).⁷ (The same problem applies with lynchings, but the replication files don’t have count data, so we can’t quantify it.)

What explains the missing data? It appears that Cook dropped any state-year observation that had a variable with a missing value. The resulting dataset has no variables with missing values, but a lot of missing state-year observations, and hence a severely imbalanced panel.

With this low level of data coverage, I’m skeptical of the panel data results in Tables 7 and 8. It’s possible that these results are unbiased, and would remain stable as the missing data was filled in (through a law of large numbers argument). Especially considering the prior plausibility that racial violence and patents are negatively correlated, we should place some weight on this.

But it’s also possible that they’re false positives. And statistically significant results are easy to get when you’re working with small effects and noisy data. For example, let’s check for heterogeneous effects by region; a robust result should be stable across different cuts of the data. From Table 7, I run the Column 1 regression separately for each region:

The lynchings estimate for the South (-0.075) is similar to the average effect from the full sample (-0.058). But there’s no estimate at all for the Midwest and Northeast, since there were zero lynchings in those regions. The estimate for the Mid-Atlantic is huge with two stars, 200x bigger than the South estimate. But this is almost certainly a Type M error (an overestimate of the true effect), as the lynching rate for the Mid-Atlantic is 3% of the average.

With only 5 riots in the dataset, it’s no surprise that there’s no estimate for the Midwest, Northeast, or West regions (which had zero riots in this data). The effect size is somewhat similar for the South and Mid-Atlantic, perhaps indicating a more homogeneous effect of riots on patenting.

For segregation laws, the Table 7, Column 1 estimate is -0.1. The effects for the South and West are in the ballpark, at -0.19 and -0.16. But the effects for the Midwest and Mid-Atlantic are positive, massive, and have three stars! But statistical significance doesn’t mean anything here, because the data is noisy. There are 19.33 new segregation laws in the data, with 17 in the South, 1 in the Midwest, 1 in the West, and 0.33 in the Mid-Atlantic (presumably a data error).

Another way to assess noisy data is to decompose the patenting variable by economic category. In fact, Cook does this in Table 8, running separate regressions for assigned patents (e.g., the patentee sells their patent to a firm), mechanical patents, and electrical patents (note that mechanical and electrical patents can be assigned or not).⁸

(For comparison, the Table 7, column 1 estimates are: lynchings -0.058***, riots -0.429***, segregation laws -0.1.) The lynching estimates are much smaller than in Table 7, and none are statistically significant.⁹ The riot estimates have the same sign and similar magnitude only for assigned patents. For segregation laws, the coefficient has the opposite sign for assigned, double the magnitude for mechanical, and half the magnitude for electrical patents. Overall, there is strong heterogeneity in the effects of racial violence, and Cook does not provide a theory to predict the pattern of varying estimates. This heterogeneity is more consistent with noise than a clear causal effect.

My takeaway from these subsample results is that the missing data is causing low statistical power, and we’re seeing Type S and Type M errors. Hence, we shouldn’t place much weight on the correlations in Tables 7 and 8, since they would probably change considerably if we had a complete and balanced panel.

Conclusion

To summarize, the main time series result in Cook (2014) is not robust to using an alternative patent variable, and the panel data results are questionable because of missing data. Nonetheless, the conclusions remain plausible, because they have a high prior probability. Lynchings, race riots, and segregation laws were a severe problem, and it would be astonishing if they didn’t have pervasive effects on the lives of Black people.

But with the data available, it’s unrealistic to think we can statistically detect causal effects. Credible causal inference would require more complete data as well as an identification strategy more convincing than a panel regression (not to mention modelling temporal and spatial spillovers). Descriptive analysis is the most that this dataset can support, and is a valuable contribution in itself, along with the rich qualitative evidence in the paper.

Cook deserves credit for pursuing this important research question and putting in years of effort to collect the patent data. And in fact, recent research, no doubt inspired by Cook, does find that segregation (of the federal government by Woodrow Wilson) and riots (specifically, the Tulsa Race Massacre) had substantial negative effects on Black Americans. I hope that more researchers continue in Cook’s footsteps and bring attention to the consequences of America’s racist history.

In terms of computational reproducibility, Cook’s code has several problems:

The code for Figures 1, 2, and 3 is in Stata graph editor format, which cannot be run from a do-file.
Figure 1 uses the variable patgrntpc, patents by grant-year per capita, but the graph refers to patents per million. Similarly, Table 5 reports ‘Patents, per million’, but the code uses patgrntpc. The variable should be named ‘patents by grant-year per million’.
There’s no code for Table 4.
Equation 1 and Table 6 refer to patents per capita, but the variable in the code, patgrntpc, has mean values of 0.16 for Blacks and 425 for whites; this is patents per million, not per capita.
The code for Table 6 refers to a variable LMRindex, but the dataset contains DLMRindex.
Section 3.2 mentions that the state-level regressions use data over 1882-1940, but the code uses data over 1870-1940.
The code for Table 7 includes a command to collapse the data down to the state-year level, but the data is already in a state-year panel.
The code for Table 7 includes a variable, estbnumpc, for the number of firms per capita, but it is not included in the dataset.
The code for Column 1 in Table 7 includes the ‘number of firms’ variable, but the paper only includes it in columns 3-6.
In the notes to Tables 7 and 8, Cook writes that “Standard errors robust to clustering on state and year are in parentheses.” However, the code only clusters by state, using vce(cl stateno).
The code for Table 8 has an error in its clustering command, using the incorrect syntax vce(stateno) instead of the correct vce(cl stateno).
The code for Table 8 does not exactly reproduce the results in the paper. When I run the code, I get N=429, while Cook’s regressions have N=428. It’s possible that Cook is controlling for firms per capita, as in Table 7, but this variable is not included in the code, and is not mentioned in the table.
The code for Table 9 does not reproduce the results in the paper.

There are also a few data errors:

State 9 has the South dummy equal to 1 for all years, but also has the Mid-Atlantic dummy equal to 0.33 in 1888.
State 14 has the Midwest dummy equal to 1 in all years except 1886, when both it and the South dummy are 0.5.
State 31 in 1909 has a value of 0.333333 for ‘number of new segregation laws’, which should be integer-valued.

Footnotes

See here for code.

Cook notes that “a comparison of a sample of similar patents obtained by white and African American inventors shows that the time between patent application and grant for the two groups was not significantly different, 1.4 years in each case.” (p.226, fn. 15) Also, there is no application-year patent data for 1870-72. ↩
This discrepancy becomes even more puzzling when we compare the paper and the code:
- Figure 1 reports patents per million by grant year, but uses a variable named patgrntpc with the label ‘Patents by grant year’. The ‘pc’ would seem to indicate patents per capita.
- Figure 2 reports patents per million by application year, using a variable pat_appyear_pm, with ‘pm’ corresponding to ‘per million’.
- Table 5 presents descriptive statistics, with a ‘Patents, per million’ variable with a mean of 0.16, but the code uses patgrntpc.
- Equation 1 and Table 6 both refer to patents per capita. The code for Table 6 uses the logarithm of patgrntpc.
Although the variable patgrntpc would seem to be ‘Patents by grant year, per capita’, this can’t be true: the average value is 0.16 for Blacks, and 425 for whites. These values are clearly measured per million. So the variable must be misnamed, and actually represents patents per million, as described in Figure 1 and Table 5. This means that Equation 1 and Table 6 are mistaken: the dependent variable is log patents per million, and not log patents per capita. ↩
Another explanation is that the application-year variable counts all patents that were applied for, including patents that were denied. This is not consistent with the text, where Cook only mentions 726 granted patents. In footnote 15, Cook writes that analyzing “[a]pplication rejection rates […] is beyond the scope of the current paper.” Moreover, even if true, this explanation doesn’t account for why the grant-year variable does not add up to 726. ↩
Cook’s Table 6 incorrectly shows the lynching estimates in Columns 2 and 3 as having p-values less than 0.05. ↩
Note that N = 110 and 55 instead of 112 and 56.

↩
Number of states by region: South 15, Midwest 12, Northeast 6, West 12, Mid-Atlantic 7. Eleven states enter after 1870, and hence have fewer than 71 years in the complete panel. See code for details. ↩
The actual number is 19.33. Somehow, one state-year observation has a value of 0.33 for the number of new segregation laws. ↩
In Column 4, Cook runs a regression using Southern patents as the dependent variable. That is, while still using the full panel, the patent variable is set to 0 for non-Southern states. This is an incorrect approach for estimating heterogeneous effects. A correct approach would restrict the sample to Southern states, as I did above, or use the full sample and interact the violence variables with a South dummy. ↩
Cook mentions in footnote 49 that lynchings have a negative effect on ‘miscellaneous patents’, but this is not reported in the table, and the variable is not included in the dataset. ↩

Did medical marijuana legalization reduce crime? A replication exercise

2021-03-19T20:00:00+00:00

Summary

In this post I replicate the paper “Is Legal Pot Crippling Mexican Drug Trafficking Organisations? The Effect of Medical Marijuana Laws on US Crime” by Gavrilova, Kamada, and Zoutman (Economic Journal, 2019; replication files).

I find three main problems in the paper:

it uses weighting when its own justification doesn’t apply
it uses a level dependent variable, and isn’t robust to log-level or Poisson models
it does not test for pretrends in the disaggregated crime variables, and two alternative event studies show that the results are driven by differential trends

Introduction

This paper studies the effect of medical marijuana legalization on crime in the U.S., finding that legalization decreases crime in states that border Mexico. The paper uses a triple-diff method, essentially doing a diff-in-diff for the effect of legalization on crime, then adding an interaction for being a border state.

The paper uses county level data over 1994-2012, with treatment (medical marijuana legalization, MML) occurring at the state level. The authors use “violent crimes” as the outcome variable in their main analysis, defined as the sum of homicides, robberies, and assaults, where each is measured as a rate per 100,000 population. They also perform separate analyses for each of the three crime categories.

The basic triple-diff regression is:

\[y_{cst} = \beta^{border} D_{st}B_{s} + \beta^{inland} D_{st} (1-B_{s}) + \gamma_{c} + \gamma_{t} + \varepsilon_{cst}.\]

Here \(y_{cst}\) is the outcome in county \(c\) in state \(s\) in year \(t\); \(D_{st}\) is an indicator for having enacted MML by year \(t\); \(B_{s}\) is an indicator for bordering Mexico; \(\gamma_{c}\) are county fixed effects; \(\gamma_{t}\) are year fixed effects. The full model also includes time-varying controls, border-year fixed effects, and state-specific linear time trends. The outcome is crime rates per 100,000 population, measured in levels, so the regression coefficients will not have a percentage interpretation; we’ll come back to this later.

This isn’t a standard triple-diff. In this model, \(\beta^{border}\) is capturing the absolute effect of MML in border states, and not the differential effect relative to inland states. To see this, compare to:

\[y_{cst} = \beta^{DD} D_{st} + \beta^{DDD} D_{st} \times B_{s} + \gamma_{c} + \gamma_{t} + \varepsilon_{cst}.\]

Here, \(\beta^{DD}\) represents the effect of MML in inland states, and \(\beta^{DDD}\) is the differential effect in border states (relative to the effect in inland states). That is, \(\beta^{inland} = \beta^{DD}\) and \(\beta^{border} = \beta^{DD} + \beta^{DDD}\). This is perhaps an issue of taste. What I would primarily want to know is whether MML had a larger effect in border states relative to inland states; the absolute effect in border states is secondary. Hence, I will report results from the second model (although the differences are small, because the inland effect is small: \(\beta^{inland} = \beta^{DD} \sim 0\)).

The authors find that, on average, MML reduces violent crimes by 35 crimes per 100,000 population, but the estimate is not statistically significant (the standard error is 22). Then, zooming in on the border states, they find a significant reduction of 108 crimes per 100,000 (and a nonsignificant increase of 2.8 in inland states). There are three border states that legalized medical marijuana: California, New Mexico, and Arizona. (Texas is the remaining border state.) Splitting up the effect by treated border state, we have a reduction of 34 in Arizona, 144 in California, and 58 in New Mexico.

I don’t really like this “zoom in on the significance” style of research. We can always find significance if we run enough interactions. And as we zoom in on subgroups, we lose external validity: can we make meaningful predictions for a state or country that was legalizing marijuana and didn’t border on Mexico? Moreover, the identifying assumptions become harder to believe. When n=3, it’s more plausible that differential shocks are driving the result (compared to n=20, say). That is, it could be that crime was already decreasing in the three border states when they passed MML, and the negative correlation between MML and crime is coincidental.

Ok, let’s get into the issues.

Weighting

The authors use weighted least squares (weighting by population) for their main results. They justify weighting by performing a Breusch-Pagan test, and finding a positive correlation between the squared residuals and inverse population. This implies larger residuals in smaller counties. In other words, there is heteroskedasticity, and weighting will decrease the size of the standard errors, i.e., increase precision. However, in Appendix Table D7, you’ll note that while they get a positive correlation when using homicides and assaults as the dependent variable, this coefficient is negative and nonsignificant for robberies. So by the Breusch-Pagan test, the robbery results actually should not be weighted. And in Table D9, the unweighted robbery estimate has smaller standard errors than the weighted one: weighting is reducing precision. And yet, the paper still uses weighting when estimating the effect of MML on robberies (in Table 4). We’ll see below that this makes a big difference for the effect size.

Modelling the dependent variable

The authors estimate the effect of MML on crime using a level dependent variable instead of taking the logarithm, which I had thought was standard. In particular, their main results use the aggregate crime rate, which leads to a “level” interpretation: MML reduces the crime rate by \(\hat{\beta}=\) 108 crimes per 100,000 population.

I would have used a log-level regression, taking \(log(y+1)\) for the dependent variable (adding 1 if there are zeroes in the data), which gives a percentage (or semi-elasticity) interpretation: MML reduces crime by \(100 \times (exp(\hat{\beta})-1) \%\). The paper doesn’t justify why they don’t use a log-level model. This is even more surprising when you see that they manually calculate the semi-elasticity (p.19), again without mentioning the log-level approach.

After spending some time looking into this question of logging the dependent variable for skewed, nonnegative data, I’m still pretty confused. It seems the options are: (1) level-level regression, as used in this paper; (2) log-level regression; (3) transforming \(y\) with the inverse hyperbolic sine; and (4) Poisson regression (with robust standard errors, you don’t need to assume mean=variance). But it’s not clear what the “correct” approach is. I’d expect a true result to be robust across multiple approaches, so let’s try that here.

I estimate the triple-diff model using a level-level regression (to directly replicate the paper), a log-level regression, and a Poisson regression. (The inverse hyperbolic sine approach is almost identical to log-level, so I skip it here.) To see how the specification matters, I conduct a specification curve analysis using R’s specr package. Specifically, I run all possible combinations of model elements, either including or excluding covariates, population weights, state-specific linear time trends, and border-year fixed effects. This will allow us to see whether possibly debatable modelling choices, such as state-specific linear trends, are driving the results.

Here are the homicide results, first in the level-level model (as in the paper). Panel A plots the coefficient estimates in increasing order, while panel B shows the corresponding specification. Each specification has two markers in panel B, one in the upper part indicating the model, and one in the lower part indicating whether all or no covariates are included in the model.¹ For example, the specification with the most negative estimate is ‘trends + weights, no covariates’. In both panels, the x-axis is just counting the number of specifications, and the color scheme is: (red, negative and significant), (grey, insignificant), (blue, positive and significant). The ‘baseline’ specification omits the state-specific trends, border-year fixed effects, and doesn’t weight by population. I’ll be focusing on the full specification, ‘trends + border + weights, all covariates’, which includes state-specific linear trends, border-year fixed effects, and weights by population.

Level-level model: homicides

We can see that the estimate is negative and statistically significant in the full specification, with and without covariates. Most estimates are nonsignificant; these are generally the unweighted models, indicating the importance of population weighting for these results.

Log-level model: homicides

Next, in the log-level model, most estimates are insignificant, including the full specification. Two models even have positive and significant results (in blue). Let’s see the Poisson model:

Poisson model: homicides

Here I use the homicide count (instead of the rate per 100,000 population), though note that the controls include log population and I’m weighting by population. In this case, the estimates are almost exactly zero and nonsignificant in the full specification. So, the homicide results only go through using the level-level regression, and not in the log-level or Poisson models.

For the other dependent variables, I’ll show the graphs in the footnotes. The results for robberies are more robust. The full specification is negative and significant across all three models.² However, the assault results are not robust, with the full specification nonsignificant for both log-level and Poisson regressions.³

This doesn’t look great for the paper. I’d expect real effects to be robust across the three models. I conclude that at best, the paper provides evidence for an effect of MML on robberies in border states, but not on homicides or assaults. And this is assuming the event study graph looks good for pretrends, which I’ll discuss next.

Event study

There are big trends in crime over this period. Crime fell a lot during the 90s, and again after 2007. To show that their results aren’t driven by these trends, the authors present an event study graph in Figure 6, estimating a triple-diff coefficient in each year. Basically, this is estimating the triple diff for each year relative to an omitted period.

The authors estimate their main results using the 1994-2012 sample. For the event study, they also use an extended sample from 1990-2012. The extended sample has issues, because it uses flawed imputed data over 1990-1992, and the year 1993 is missing entirely. Here I will show results from the main sample, 1994-2012.

For their main event study, the authors only include dummies for relative years -2 to 4, and bin all years 5+ in one dummy. This is because California is treated in 1996 and only has two years of pretreatment data, and wouldn’t contribute to any dummies before -2.⁴ But this is a bit of an arbitrary choice. Similarly, Arizona is treated in 2010 and only has two years of post-treatment data, and hence doesn’t contribute to any dummies after +2. So should we include dummies only for [-2,2]?

I think it’s fine to include dummies for [-5,5], with the understanding that some states do not contribute to some estimates. (Specifically, California doesn’t have dummies for -5 to -3, and Arizona doesn’t have dummies for 3 to 5+.) In this setup, the omitted years are <-5, in contrast to the standard approach of omitting relative year -1. (As noted in the last footnote, California has no omitted years, so the software should drop one year.)

Next I plot my version of their event study graph, using a level dependent variable. Since this is a triple-diff, I include relative year dummies for the treated states, as well as separate relative year dummies for the treated border states. I plot the coefficients on the border-state relative year dummies. While the paper only includes dummies for [-2,5+], I estimate coefficients for [-5,5+].⁵

Event study: violent crimes (binning 5+)

Compared to the event study in the paper, here the coefficients are all negative. For the pretreatment estimates (treatment occurs in period 0), this means level differences between the treated border and inland states. There is also a slight downward trend before the treatment, hinting at differential trends.

In any case, note that this graph is for the aggregated violent crime variable. Where are the event studies for the individual dependent variables? The authors do not show them! This is a major flaw, and I can’t believe that the referees missed it. Even if we found no pretrends in the aggregate variable, there could still be pretrends in the component variables. Let’s take a look ourselves.

Event study: homicides (binning 5+)

First up, using the homicide rate as the dependent variable, we get a big mess. There are big movements in years -3 and -2: relative to the treatment year, homicides were higher three years prior, and lower two years prior. So at least for homicides, it looks like the negative triple-diff estimate could just be picking up noise. Now we know why the authors didn’t include separate event study graphs by dependent variable.

Event study: robberies, unweighted (binning 5+)

For robberies, recall that the Breusch-Pagan test failed to justify weighting, so I do not use weights. Here, it also looks like a negative trend is driving the result: robberies were smoothly decreasing in treated border states before MML was implemented. (See the unweighted graph in the footnote.⁶) The common trends assumption for the triple-diff appears to be violated.

Event study: assaults (binning 5+)

Finally, for assaults, the event study actually doesn’t look bad, although the standard errors are large. This is a bit surprising, given that the assault results were not robust across log-level and Poisson models.

Overall, this doesn’t look good for the paper. I think this is an equally defensible event study method, but it nukes their homicide and robbery results.⁷

I’m not a fan of binning in event studies. In Andrew Baker’s simulations, binning periods 5- and 5+ performs badly. In contrast, a fully-saturated model including all relative year dummies (except for relative year -1, which is the omitted year) performs perfectly. So let’s try that here.

By omitting year -1, we’re basically normalizing the above event study graphs around the -1 estimate (but also changing the estimates, since we’re including all other relative year dummies). Hence, the homicide graph has the same patterns, but shifted up. We again find a clear trend in the robbery graph. But now the assault graph also looks to be driven by trends.

Event study: homicides

Event study: robberies

Event study: assaults

Takeaway: now I really doubt that MML had a causal effect on crime.

Synthetic control

To further dig into these trends, I aggregated the data from county- to state-level and performed a synthetic control analysis for each of the three treated border states: California, Arizona, and New Mexico. This aggregation is probably imperfect, and it would be better to start with state-level data, but let’s see what happens. (Running level-level regressions, I still get negative results, with effect sizes similar to the county-level data. See the specification curves in the footnote. ⁸)

The idea of synthetic control is to construct an artificial control group for our treated state, so we can evaluate the treatment effect simply by comparing the outcome variable in the treatment and synthetic control states. The synthetic control group is a weighted average of control states, and these weights are chosen to match the treated state on preperiod trends. I use the nevertreated states as the donor pool; I’ll report the weights below.

Here I’ll show the robbery results for the three states (using the level dependent variable), to see what’s happening with that smooth trend. Note that these graphs are plotting the raw outcome variable, so we’re seeing the actual trends in the data.

California’s synthetic control is 68% New York and 28% Minnesota. California’s MML occurs in the middle of the 1990s crime decrease, and it doesn’t look like there’s much of an effect in 1996.

Arizona’s synthetic control is 61% Texas, 24% Florida, and 15% Wyoming. Again, there doesn’t seem to be a treatment effect.

New Mexico’s synthetic control is 51% Mississippi, 21% Louisiana, 18% Texas, and 7% Wyoming. Its MML occurs before a drop in robberies that is partly matched by the synthetic control group.

You can look at the other synthetic control graphs in this footnote.⁹

Overall, I worry that these three states coincidentally legalized medical marijuana when crime was high and falling, and that the triple-diff estimates are just picking up these trends. Based on my analysis here, I don’t believe that medical marijuana legalization reduced crime in the US.

Randomization inference

One final note: the paper calculates a (one-sided) randomization inference p-value of 0.03, and claims that this is evidence for their result being real. However, as I discuss in this post, this claim is false. With large sample sizes, there’s no reason to expect RI and standard p-values to differ, so a significant RI p-value provides no additional evidence.

Conclusion

I think it’s plausible that moving marijuana production from the black market to the legal market would reduce crime (at least in the long run). But the effect of medical marijuana legalization on crime is too small to detect in the data.

Footnotes

See here for R code, and here for the original replication files. (For some reason, the replication files aren’t online anymore.)

PS: Table 5 does heterogeneity by type of homicide; I’d be curious to see the event study for each of these outcomes.

The full covariate list is: an indicator for decriminalization, log median income, log population, poverty rate, unemployment rate, and the fraction of males, African Americans, Hispanics, ages 10-19, and ages 20-24. In general, I find that adding controls barely changes the \(R^{2}\), so these variables aren’t adding much beyond the county and year fixed effects. ↩
Robbery results:

Level-level model: robberies

In the level-level model, we see a big difference between the weighted and unweighted results. Clearly, there are heterogeneous treatment effects, with larger effects in the higher-weight states (California, probably). As I noted above, the robbery estimates should not be weighted.

Log-level model: robberies

Poisson model: robberies

↩
Assault results:

Level-level model: assaults

Log-level model: assaults

Poisson model: assaults

↩
One problem with this specification is that California has no omitted years. Every year from 1994-2012 has a dummy variable, which seems like a dummy variable trap (i.e., multicollinearity). Specifically: 1994-2000 are covered by dummies for -2 to 4, and 2001-2012 are covered by the 5+ binned dummy. ↩
Moreover, as noted above, I am estimating the differential effect of MML in border states relative to inland states, while GKZ are estimating the absolute effect. I also drop counties that have the black share of population greater than 100%. It seems the authors were doing some extrapolation that got out of control. ↩
We shouldn’t care about this graph, because weighting is unwarranted.

Event study: robberies, weighted (binning 5+)

↩
It’s depressing that event studies can differ so much based on slight model changes. I have a feeling that a lot of diff-in-diffs from the past twenty years are not going to survive replication. ↩
Specification curve for state-level results:

Level-level model: homicides

Level-level model: robberies

Level-level model: assaults

↩
Synthetic control results for homicides and assaults. ↩

How I use regression weights to replicate research

2021-02-25T20:00:00+00:00

One of the main tools I use for replication is regression weights. These show the weight that each observation contributes to a regression coefficient. Suppose we’re regressing \(y\) on \(X_{1}\) and \(X_{2}\), with corresponding coefficients \(\beta_{1}\) and \(\beta_{2}\). Then, the regression weights for \(\beta_{1}\) are the residuals from regressing \(X_{1}\) on \(X_{2}\), which represent the variation in \(X_{1}\) remaining after controlling for \(X_{2}\). From Frisch-Waugh-Lovell, we know that \(\beta_{1}\) can be estimated by regressing \(y\) on these residuals. Hence, the regression weights show the actual variation used in the estimate. When replicating a paper, looking at regression weights is a handy way to see what’s actually driving the result.

In this post, I’ll give a quick demo of regression weights, looking at Cook (2014) (sci-hub) (replication files) on the effect of racial violence on African American patents over 1870-1940. This paper starts with striking time series data on patents by African American inventors. In Figure 1, we see a big drop in black patents around 1900. What is driving this pattern?

Cook argues that race riots and lynchings cause reduced patenting, directly by intimidating inventors, and indirectly by undermining trust in intellectual property laws (if the government won’t punish race rioters, why should you believe it’ll enforce your patents?).

Table 7 contains the main state-level regressions of patents on lynching rates and riots. Using a random-effects model, Cook finds negative effects for both lynchings and riots. I find similar results with a fixed effects model.

Let’s do regression weights, first for the lynching result. I regress lynchings on the other variables, grab the residuals, square them, then normalize by the sum of squared residuals.

* Stata code:

use pats_state_regs_AAonly, clear

reghdfe lynchrevpc riot seglaw illit blksh regs regmw regne regw , ab(stateno year1910 year1913 year1928) vce(cl stateno) res(resid)

gen res1 = resid^2
egen resid_tot = total(res1)
gen regweight = res1/resid_tot

Next, let’s see how these weights vary by region.

table region, c(sum regweight count patent)

This is a bit surprising. The South has 81% of the weight, with the remainder coming from the West. The other three regions have basically zero contribution to the lynchings coefficient.

So let’s see what’s happening in the data.

table region, c(mean lynchrevpc)

It turns out that basically all lynchings occurred in the South and West, with zero in the Midwest and Northeast (and roughly zero in Mid-Atlantic). Given this, the regression weights make sense. When there’s no variation in a variable, it should contribute nothing to the regression. But because the Midwest and Northeast have data on the other covariates, they still add some information, which is why the weights aren’t exactly zero.

Next, let’s see the results for the effect of riots on patenting. First, the regression weights, regressing riots on the other controls:

reghdfe riot lynchrevpc seglaw illit blksh regs regmw regne regw , ab(stateno year1910 year1913 year1928) vce(cl stateno) res(resid2)

gen res2 = resid2^2
egen resid_tot2 = total(res2)
gen regweight2 = res2/resid_tot2

table region, c(sum regweight2 count patent)

Again, the regional patterns are surprising. This time, the South has 27% of the weight, and the Mid-Atlantic has 73%, with the other regions contributing nothing. What’s going on?

gen region = .
replace region = 1 if (regs)
replace region = 2 if (regmw)
replace region = 3 if (regne)
replace region = 4 if (regw)
replace region = 5 if (regmatl)

label define reg_label 1 "South" 2 "Midwest" 3 "Northeast" 4 "West" 5 "Mid-Atlantic"
label values region reg_label

table region, c(sum riot)

It turns out there are only 5 riots in the state-level data. Let’s dig deeper.

table stateno, c(sum regweight2 count patent)
table year, c(sum regweight2 count patent)

* output omitted

The regression weight is concentrated on three states: state 39 has 58%, state 44 has 27%, and state 33 has 15% (the names are not in the data). It’s also concentrated on four years: 56% on 1917, 12% on 1918, 15% on 1900, 12% on 1906. This is because there are five riots occurring in four years, with two in 1917 in state 39, two in state 44 in different years, and one in state 33. So the riot effect is driven almost entirely by the four state-year observations that had riots.

But wait. If you look, you’ll see that there are 35 riots in the time-series data.

use pats_time_series, clear
collapse (sum) riot if race==0
su riot

Where did the other riots go? It looks like the state data just has a lot of missing observations, which would explain the missing riots. That is, the issue isn’t variables with missing values, but that most state-year observations do not even have a row in the data. (I emailed Cook to ask about this, but didn’t get a response.) As you can see, the sample size fluctuates over time; this is far from a balanced panel.

Note that 1917 and 1918 have the majority of the weight, but there are only two observations in each of those years.

The paper is not very clear about this. Table 5 reports the descriptive stats, but only has the riots variable for the time-series data, and not the state-level data. And Cook does not plot any of the raw state-level data, but instead jumps right into the regressions.

This seems like a serious problem for the riot results. The paper isn’t estimating the effect of riots on patenting; instead, it’s doing the effect of five specific riots. If we could collect data on the remaining 30 riots, I’d expect the estimate to change. In other words, why should we expect this result to be externally valid for other historical riots?

To sum up, regression weights are an easy way to dig into a paper and see exactly what’s driving their results.

Happy replicating!

Does meritocratic promotion explain China's growth?

2021-02-05T20:00:00+00:00

One explanation for China’s rapid economic growth is meritocratic promotion, where politicians with higher GDP growth are rewarded with promotion. In this system, politicians compete against each other in ‘promotion tournaments’ where the highest growth rate wins. This competition incentivizes politicians to grow the economy, and hence helps explain the stunning economic rise of China.

The literature on meritocratic promotion finds evidence of meritocracy for province, prefecture, and county leaders.¹ However, as I discuss in my dissertation, the evidence for province and prefecture leaders is weak. In the provincial literature, the initial positive finding was not confirmed in follow-up studies. And when I replicated the prefecture literature, I found that the results there were not robust. So we don’t have strong evidence that province and prefecture leaders are promoted based on GDP growth. But, using data from two papers, I did find some evidence for meritocratic promotion of county leaders (details here).

So how should we think about meritocracy in China? Despite the lack of evidence for meritocratic promotion at the province and prefecture levels, it’s still plausible that meritocracy has contributed to China’s growth. Let’s grant that county leaders are promoted meritocratically, directly incentivizing them to boost GDP growth.² This means that high-growth county leaders are promoted to prefecture positions. But since prefecture leaders then consist only of high-growth leaders, there isn’t enough variation in growth to implement a prefecture-level promotion tournament. In other words, range restriction prevents the Organization Department from implementing meritocratic promotion above the county level. Running a successful county-level promotion tournament precludes prefecture and provincial tournaments. Hence, the Organization Department must use other criteria in determining promotions of prefecture and provincial leaders.

So county leaders are continuously incentivized to boost economic growth, and only leaders with demonstrated growth-boosting ability are promoted to prefecture and provincial positions. While they are not directly incentivized, these prefecture and province leaders are selected based on their ability to grow the economy, and they supervise the county leaders in their prefecture/province. We can think of this as a version of partial meritocracy, in contrast to a ‘maximal’ version where leaders at all levels are incentivized through promotion tournaments. While the maximal version provides the strongest incentives for boosting GDP growth, the partial version does generate some incentives as well.

Thus, despite the lack of evidence at higher levels of government, meritocracy does partly explain China’s economic growth.

Footnotes

Read my papers on meritocratic promotion: null result and replications.

There are six administrative levels in the Chinese government: center, province, prefecture, county, township, and village. ↩
Based on my experience replicating the prefecture literature, we should wait to see more evidence before drawing firm conclusions for county-level meritocracy (e.g., extending the sample period, trying different promotion definitions). ↩

Replicating the literature on meritocratic promotion in China

2021-02-04T20:00:00+00:00

China has had double-digit economic growth for nearly three decades. How can we explain this? In my dissertation, I studied one explanation that is backed up by a large literature: meritocratic promotion. The idea is that politicians compete in promotion tournaments, where the politician with the highest GDP growth rate in their jurisdiction is rewarded by being promoted. By tying promotion to economic growth, meritocratic promotion creates strong incentives to boost GDP, and hence helps explain China’s rapid growth.

When I collected data on prefecture politicians, however, I found no evidence for meritocracy: there was no correlation between GDP growth and promotion, despite trying many different models. How is this null result consistent with the positive findings in the rest of the literature? To find out, I replicated the main papers claiming evidence for prefecture-level meritocracy. Short answer: the literature is wrong.

This post summarizes my replications. I find that the results in the literature are not robust to reasonable specification changes, or are due to data errors. You can find the full details, and a few more replications, in the paper here.

Yao and Zhang (2015)

Yao and Zhang (2015), published in the Journal of Economic Growth, was the first paper to study meritocratic promotion at the prefecture level in China. They estimate a leader’s ability to grow GDP, and then estimate the relationship between ability and promotion. If promotion is meritocratic, we should see a positive correlation, as high-growth leaders are promoted.

However, they find no average correlation between leader ability and promotion: leaders with higher ability are not more likely to be promoted. Despite this, the authors do not frame their paper as contradicting the literature.¹ Moreover, this paper is cited in the literature as supporting the meritocracy hypothesis.²

This is because the authors further test for an interaction between leader ability and age, reporting a positive interaction effect that is significant at the 5% level. Narrowing in on specific age thresholds, they find that leader ability has the strongest effect on promotion for leaders older than 51. They conclude that leader ability matters for older politicians, because more years of experience produces a clearer signal of ability.

Now, this result is consistent with a limited promotion tournament, where the Organization Department promotes older leaders based on their ability to boost growth (because older leaders have clearer signals of ability), but applies different promotion criteria to younger leaders (whose signals are too weak to detect). But this limited model contradicts the usual characterization of China’s promotion tournament as including all leaders, irrespective of age: in each province, leaders compete to boost GDP growth, and the winners are rewarded with promotion.

This is actually a big discrepancy, because half of all promotions occur for leaders younger than 51. If the Organization Department cannot measure ability for these young leaders, what criteria does it use to promote them? Furthermore, remember that the original motivation was to explain China’s rapid growth. The incentives generated by this limited tournament are weaker, since the reward is only applied later in life; if young leaders are impatient, they will discount this future reward and put less effort into boosting growth. The limited tournament model has less explanatory power.

At this point, it is not clear to me why this paper has been cited without qualification as evidence for meritocratic promotion. It offers no general support for meritocracy, and its model of a limited promotion tournament partly contradicts the literature.

But I’m not stopping here. Finding a null average effect with a significant interaction is a classic formula for p-hacked results in social psychology. Since the age interaction doesn’t make much sense, I don’t believe that the authors started out planning to run this test. Rather, it looks like they wanted to find a positive average effect, but didn’t. But they’d already invested a lot of time in collecting the data and working out a clever identification strategy, so they found an interaction that got them statistical significance, even if the interpretation wasn’t really consistent. Hence, I reject their p-value as invalid.³

And it turns out that this is the right call. Digging into the paper, I find that the significant interaction term depends on including questionable control variables.

When estimating leader ability, the authors regress GDP growth on three fixed effects (leader, city, year) as well as three covariates: initial city GDP per capita (by leader term), annual city population, and the annual provincial inflation rate. I think it makes sense to control for initial GDP by term. The model includes city effects, so level differences in growth rates are not an issue. But we might worry that the variance of idiosyncratic shocks to growth is correlated with city size, and growth shocks could affect promotion outcomes.

However, it is not clear why population and inflation should be included. The authors mention that labor migration can drive GDP growth (p.413), but a leader’s policies affect migration, so population is plausibly a collider or ‘bad control’, if leader ability affects growth through good policies that increase migration. The authors provide no justification for including inflation, which is odd because the dependent variable (real per capita GDP growth) is already expressed in real (rather than nominal) terms.

Given the lack of justification for including population and inflation as covariates, I re-estimate leader ability controlling only for initial GDP. Using this new estimate of ability, I then replicate their main results. I again find a nonsignificant average effect of ability on promotion. But now the interaction with age disappears. The sign remains positive, but the magnitude of the coefficient drops by half, and the results are nonsignificant.

So it turns out that Yao and Zhang (2015) offers no evidence for meritocratic promotion of prefecture leaders.

Li et al. (2019)

Li et al. (2019), published in the Economic Journal, studies GDP growth targets and promotion tournaments in China. They start with the observation that growth targets are higher at lower levels of the administration; for example, prefectures set higher targets than do provinces. Their explanation is that the number of jurisdictions competing in each promotion tournament is decreasing as one moves down the hierarchy, which increases the probability of a leader winning the tournament. As a consequence, leaders exert more effort, and higher-level governments can set higher growth targets without causing leaders to quit.

As part of their model, they assume that promotion is meritocratic: performance (measured by GDP growth) increases the probability of promotion. Further, they report an original result: the effect of performance on promotion is increasing in the growth target faced. That is, a one percentage-point increase in growth will increase a mayor’s chances of promotion by a larger amount when the provincial target is higher, relative to when the target is lower.

This result seems naturally testable by interacting \(Growth \times Target\) in a panel regression, with a predicted positive coefficient on the interaction term. However, the authors argue that OLS is invalid, instead reporting results based on maximum likelihood where promotion is determined by a contest success function. Why does OLS not apply? “Standard linear regression does not work here partly because promotion is determined by local officials’ own growth rates as well as by the growth rates of their competitors. The nonlinearity of the promotion function is another factor that invalidates the OLS estimation.” (p.2906)

But these are not problems for OLS. First, as is standard in this literature, the promotion tournament can be captured by using prefecture growth rates relative to the annual provincial growth rate. Second, OLS is the best linear approximation to a nonlinear conditional expectation function. So if there is a positive nonlinear relationship between promotion and growth, we should be expect that it will be detected by OLS.⁴

Given the lack of justification for omitting results from linear regression, I replicate their results using a linear probability model and logistic regression. First, I test the generic meritocracy hypothesis. I find that GDP growth has no average effect on promotion. Next, I do find a positive interaction effect between growth and growth target, but it’s not statistically significant.

This doesn’t look good for the authors. OLS is the default method, and you need a strong justification for not reporting it. But their reasons are flimsy. Now it looks like they tried OLS, didn’t get the result they wanted, then made up a complicated maximum likelihood model that delivered significance.

So Li et al. (2019) is another paper that claims to provide evidence for meritocratic promotion of prefecture leaders, but is unable to back up those claims.

Chen and Kung (2019)

Chen and Kung (2019), published in the Quarterly Journal of Economics, studies land corruption in China, with secondary results on meritocratic promotion. The main result is that local politicians provide price discounts on land sales to firms connected to Politburo members, and these local politicians are in turn rewarded with promotion up the bureaucratic ladder.

For provincial leaders, they find a strong effect of land sales on promotion for secretaries, but not for governors. In contrast, GDP growth strongly predicts promotion for governors, but not secretaries. They conclude that “the governor has to rely on himself for promotion, specifically by improving economic performance or GDP growth in his jurisdiction [...] only the provincial party secretaries are being rewarded for their wheeling and dealing".

They find similar results at the prefecture level: land deals predict promotion for secretaries, but not for mayors, while GDP growth predicts promotion for mayors, but not for secretaries. Overall, this supports the model of party secretaries being responsible for social policy, while governors (and mayors) are in charge of the economy, with performance on these tasks determining promotion. Thus, at both province and prefecture levels, government leaders (governors and mayors) compete in a promotion tournament based on GDP growth, while party secretaries do not.

However, Chen and Kung (2019)’s results for prefecture mayors are questionable, because their promotion data seems wrong. In my data, the annual promotion rate varies from 5 to 30% (peaking in Congress years), while the Chen and Kung (2019) data never exceeds 15% and has six years where the promotion rate is less than 2%. Figure 1 compares the annual promotion rate from Chen and Kung to my own data as well as the data from Yao and Zhang (2015) and Li et al. (2019), where each paper uses a binary promotion variable (and data on prefecture mayors). While the latter three sources broadly agree on the promotion rate, the Chen and Kung data is a clear outlier. This is obviously suspect.

Furthermore, upon investigating this discrepancy, I discovered apparent data errors in their promotion variable. The annual promotion variable is defined to be 1 in the year a mayor is promoted, and 0 otherwise. However, out of the 201 cases with \(Promotion=1\), 124 occur before the mayor’s last year in office (with the remaining 77 cases occuring in the last year). Moreover, this variable is equal to 1 multiple times per spell in 4% of leader spells. Out of 1216 spells, 51 spells have \(Promotion=1\) more than once per spell. For example, consider a mayor who is in office for five years and then promoted; the promotion variable should be 0 in the first four years, then 1 in the final year. However, the Chen and Kung data has spells where the promotion variable is, for example, 0 in the first two years, and 1 in the final three years.

To fix this error, I obtained the raw mayor data from James Kung, and used it to generate a corrected annual promotion variable, which is 1 only in a mayor’s final year in office (when the mayor is promoted). This data-coding error more than doubles the number of promotions. But since the Chen and Kung promotion rate is smaller than the rest of the literature, fixing the data errors in fact makes the disagreement with the literature even more pronounced.

So this promotion data looks pretty lousy. Naturally, we should worry that their data is driving their finding of meritocratic promotion for prefecture mayors. To test this, I re-run their analysis using my own promotion data. I find that the correlation between GDP growth and promotion is now negative and nonsignificant. So just like the other two papers, Chen and Kung (2019) also fails to provide evidence for meritocratic promotion of prefecture leaders.

This is extremely suspicious. Speculating, it looks like the authors had a nice paper using provincial data, but a referee asked them to extend it to prefecture leaders. To fit their story, they needed to find an effect of land sales for secretaries (but not mayors), and an effect of GDP growth for mayors (but not secretaries). But maybe the data didn’t agree, and their RA had to falsify the mayor promotion data to get the ‘correct’ result. This wouldn’t be easy for referees to spot, since the replication files didn’t include spell-level data. But how else did they collect such error-ridden data that also just happened to produce results consistent with their story?

Conclusion

The original study of meritocratic promotion for provincial leaders, Li and Zhou (2005), has been cited over 2500 times. But follow-up work has repeatedly failed to confirm its finding of a positive correlation between provincial GDP growth and promotion.⁵ And as I have shown in this post, attempts to extend the meritocracy story down to prefecture leaders have also failed.

How did this happen? How could a whole literature get this wrong?

Here’s my guess: researchers set a strong prior based on the provincial result in Li and Zhou (2005), combined with the elegance of the theoretical model of a promotion tournament. Since the idea of a promotion tournament is generic, researchers naturally expected it to apply to prefecture and county politicians as well. In short, researchers doing follow-up work knew that they had to confirm the original results.

However, when they studied prefecture leaders and didn’t find a positive correlation between growth and promotion, the researchers had to fiddle around with their models and data until they got a result that matched the original. And given the multiplicity of design choices⁶, it wasn’t that difficult to find a specification that yielded statistical significance.

But why not embrace the null result and contradict the literature? After all, this is a case where a null result would be interesting, with adequate statistical power and a well-established consensus. I guess it was just easier to shoehorn their results to fit in with the literature, and get the publication, rather than challenge the consensus.

My conclusion is that publication incentives, conformism, and inadequate peer review led to a literature of false results.

Footnotes

Read the full paper here. My null result paper is here.

“We also improve on the existing literature on the promotion tournament in China. Using the leader effect estimated for a leader’s contribution to local growth as the predictor for his or her promotion, we refine the approach of earlier studies.” (Yao and Zhang 2015, p.430) ↩
For example, Chen and Kung (2016): “those who are able to grow their local economies the fastest will be rewarded with promotion to higher levels within the Communist hierarchy [...] Empirical evidence has indeed shown a strong association between GDP growth and promotion ([...] Yao and Zhang, 2015)". ↩
In a previous post, I discussed how p-values involve the thought experiment of running the exact same test on many samples of data. When designing a test, researchers need to follow a procedure that is consistent with this thought experiment. In particular, they need to design the test independently of the data; this guarantees that they would run the same test on different samples. As Gelman and Loken put it: “For a p-value to be interpreted as evidence, it requires a strong claim that the same analysis would have been performed had the data been different.”

As it happens, Yao has recently posted a working paper re-using the method in Yao and Zhang (2015). Like the first paper, the new one also studies how ability affects promotion for prefecture-level leaders, using the same approach to estimate leader effects. Importantly, they update their data on prefecture cities by extending the time series from 2010 to 2017. Thus, we have a perfect test case to see whether the same data-analysis decisions would be made when studying the same question and using a different dataset (drawn from the same population).

It turns out that the new paper doesn’t interact with age at all! Instead, it reports the average effect of ability on promotion, which is now significant, along with a new specification where ability is interacted with political connections (see Table 2). So the p-value requirement is not satisfied: the researcher performs different analyses when the data is different. Hence, our skepticism of original age interaction turns out to be justified. Since the researcher would not run the same test on new samples, the significant p-value is actually invalid and does not count as evidence. ↩
One of the authors, Li-An Zhou, was also an author on the first paper on meritocratic promotion, Li and Zhou (2005). That paper used an ordered probit model, so it is curious that they didn’t employ the same model again here. ↩
Su et al. (2012) claims that the results in Li and Zhou (2005) don’t replicate, after fixing data errors. Shih et al. (2012) finds that political connections, rather than economic growth, explain promotion. Jia et al. (2015) finds no average effect, but does report an interaction effect with political connections. Sheng (2020) finds a meritocratic effect, but only for provincial governors during the Jiang Zemin era (1990-2002). In my dissertation, I replicate this paper using the data from Jia et al. (2015); I find no effect. ↩
Here are a few of the researcher degrees of freedom available when studying meritocratic promotion: promotion definitions; growth definitions (annual vs. cumulative average vs. average GDP growth, absolute vs. relative GDP growth [relative to predecessor vs. relative to provincial average vs. relative to both], real vs. nominal GDP, level vs. per capita GDP); regression models (LPM vs. probit/logit vs. ordered probit/logit vs. AKM leader effects vs. MLE with contest success function vs. proportional hazards model); interactions (with age, political connections [hometown vs. college vs. workplace], provinces of corrupt politicians, time period); data construction (annual vs. spell-level), and so on. ↩

Dropping 1% of the data kills false positives

2021-01-26T20:00:00+00:00

How robust are false positives to dropping 1% of your sample? Turns out, not at all.

Rachael Meager and co-authors have a paper with a new robustness metric based on dropping a small fraction of the sample. It’s called the Approximate Maximum Influence Perturbation (AMIP). Basically, their algorithm finds the observations that, when dropped, have the biggest influence on an estimate. It calculates the smallest fraction required to change an estimate’s significance, sign, and both significance and sign. In other words, if you have a significant positive result, it calculates the minimum fractions of data you need to drop in order to (1) kill significance, (2) get a negative result, and (3) get a significant negative result. The intuition here is to check whether there are influential observations that are driving a result.¹ And influence is related to the signal-to-noise ratio, where the signal is the true effect size and the noise is the relative variance of the residuals and the regressors.

In a previous post, I explored how p-hacked false positives can be robust to control variables. In this post, I want to see how p-hacked results fare under this new robustness test.

Robustness of true effects

First, let’s show that real effects are robust to dropping data. I generate data according to:

\[\tag{1} y_{i} = \beta X_{i} + \varepsilon_{i},\]

where \(X\) and \(\varepsilon\) are each distributed \(N(0,1)\). I then apply the AMIP algorithm. I repeat this process for different values of \(\beta\), and the results are shown in Figure 1.

We see that as the effect size increases, the fraction of data needed to be dropped in order to flip a condition also increases. When \(\beta=0.2\), we need to drop more than 5% of the data to kill significance. This makes sense, because the true effect size increases the signal and hence the signal-to-noise ratio.

Robustness of p-hacked results

Next let’s see how robust a p-hacked result is. Now I use data

\[\tag{2} y_{i} = \sum_{k=1}^{K} \beta_{k} X_{k,i} + \gamma z_{i} + \varepsilon_{i}.\]

We have \(K\) potential treatment variables, \(X_{1,i}\) to \(X_{K,i}\), and a control variable \(z_{i}\). I draw \(X_{k,i}\), \(z_{i}\), and \(\varepsilon_{i}\) from \(N(0,1)\). I set \(\beta_{k}=0\) for all \(k\), so that \(X_{k}\) has no effect on \(y\), and the true model is

\[\tag{3} y_{i} = \gamma z_{i} + \varepsilon_{i}.\]

I’m going to p-hack using the \(X_{k}\)’s, running \(K=20\) univariate regressions of \(y\) on \(X_{k}\) and selecting the one with the smallest p-value. Then I run the AMIP algorithm to calculate the fraction of data needed to kill significance, etc.

In my previous post on p-hacking, we learned that when \(\gamma\) is small, the partial-\(R^{2}(z)\) is small, and controlling for \(z\) is not able to kill coincidental false positives. To see whether dropping data is a better robustness check, I repeat the above process for different values of \(\gamma\).

The results are in Figure 2. First, notice that we lose significance after dropping a tiny fraction of the data: about 0.3%. For \(N=1000\), that means 3 observations are driving significance.

Second, we see that the fraction dropped doesn’t vary with \(\gamma\) at all. This is good news: previously, we saw that control variables only kill false positives when they have high partial-\(R^{2}\). But dropping influential observations is equally effective for any value of \(\gamma\). So dropping data is an effective robustness check where control variables fail.

Overall, dropping data looks like an effective robustness check against coincidental false positives. Hopefully this metric becomes a widely used robustness check, and will help root out bad research.

Update (Nov. 5, 2021): Ryan Giordano gives a formal explanation here.

Footnotes

See here for R code.

In the univariate case, influence = leverage \(\times\) residual. ↩

Is randomization inference a robustness check? For what?

2021-01-23T20:00:00+00:00

I’ve seen a few papers that use randomization inference as a robustness check. They permute their treatment variable many times, and estimate their model for each permutation, producing a null distribution of estimates. From this null distribution we can calculate a randomization inference (RI) p-value as the fraction of estimates that are more extreme than the original estimate. (This works because under the null hypothesis of no treatment effect, switching a unit from the treatment to the control group has no effect on the outcome.) These papers show that RI p-values are similar to their conventional p-values, and conclude that their results are robust.

But robust to what, exactly?

Consider the case of using control variables as a robustness check. When adding control to a regression, we’re showing that our result is not driven by possible confounders. If the coefficient loses significance, we conclude that the original effect was spurious. But if the coefficient is stable and remains significant, then we conclude that the effect is not driven by confounding, and we say that it is robust to controls (at least, the ones we included).

Returning to randomization inference, suppose our result is significant using conventional p-values (\(p<0.05\)), but not with randomization inference (\(p_{RI}>0.05\)). What’s happening here? Young (2019) says that conventional p-values can have ‘size distortions’ when the sample size is small and treatment effects are heterogeneous, resulting in concentrated leverage. This means that the size, AKA the false positive rate \(P(\)reject \(H_{0} \mid H_{0}) = P(p<\alpha \mid H_{0})\), is higher than the nominal significance level \(\alpha\).¹ For instance, using \(\alpha =0.05\), we might have a false positive rate of \(0.1\). In this case, conventional p-values are invalid.

By comparison, RI has smaller size distortions. It performs better in settings of concentrated leverage, since it uses an exact test (with a known distribution for any sample size \(N\)), and hence doesn’t rely on convergence as \(N\) grows large. See Young (2019) for details. Upshot: we can think of RI as a robustness test for finite sample bias (in otherwise asymptotically correct variance estimates).

So in the case where we lose significance using RI (\(p<0.05\) and \(p_{RI}>0.05\)), we infer that the original result was driven by finite sample bias. In contrast, if \(p \approx p_{RI} <0.05\), then we conclude that the result is not driven by finite sample bias.

So RI is a useful robustness check for when we’re worried about finite sample bias. However, this is not the justification I’ve seen when papers use RI as a robustness check.

Randomization inference in Gavrilova et al. (2019)

This paper (published in Economic Journal) studies the effect of medical marijuana legalization on crime, finding that legalization decreases crime in states that border Mexico. The paper uses a triple-diff method, essentially doing a diff-in-diff for the effect of legalization on crime, then adding an interaction for being a border state.

Here’s how they describe their randomization inference exercise (p.24):

We run an in-space placebo test to test whether the control states form a valid counterfactual to the treatment states in the absence of treatment. In this placebo-test, we randomly reassign both the treatment and the border dummies to other states. We select at random four states that represent the placebo-border states. We then treat three of them in 1996, 2007 and 2010 respectively, coinciding with the actual treatment dates in California, Arizona and New Mexico. We also randomly reassign the inland treatment dummies and estimate [Equation] (1) with the placebo dummies rather than the actual dummies. [...]

If our treatment result is driven by strong heterogeneity in trends, the placebo treatments will often find an effect of similar magnitude and our baseline coefficient of -107.98 will be in the thick of the distribution of placebo-coefficients. On the other hand, if we are measuring an actual treatment effect, the baseline coefficient will be in the far left tail of the distribution of placebo-coefficients. [...]

Our baseline-treatment coefficient is in the bottom 3rd-percentile of the distribution. This result is consistent with a p-value of about 0.03 using a one-sided t-test.

At first glance, this reasoning seems plausible. I believed it when I first read this paper. The only problem is that it’s wrong.²

To prove this, I run a simulation with differential pre-trends (i.e., a violation of the common trends assumption). I simulate panel data for 50 states over 1995-2015 according to:

\[\tag{1} y_{st} = \beta D_{st} + \gamma_{s} + \gamma_{t} + \lambda \times D_{s} \times (t-1995) + \varepsilon_{st}.\]

Here \(D_{st}\) is a treatment dummy, equal to 1 in the years a state is treated. Ten states are treated, with treatment years selected randomly in 2000-2010; so this is a staggered rollout diff-in-diff. I draw state and year fixed effects separately from \(N(0,1)\). To generate a false positive, I set \(\beta=0\) and include a time trend only for the treatment group: \(\lambda \times D_{s} \times (t-1995)\), where \(\lambda\) is the value of the trend, and \(D_{s}\) is an ever-treated indicator.

Because the common trends assumption is not satisfied, diff-in-diff will wrongly estimate a significant treatment effect. According to Gavrilova et al., randomization inference will yield a large, nonsignificant p-value, since the placebo treatments will also pick up the differential trends and give similar estimates.

But this doesn’t happen. When I set a trend value of \(\lambda=0.1\), I get a significant diff-in-diff estimate 100% of the time, using both conventional and RI p-values. The actual coefficient is always in the tail of the randomization distribution, and not "in the thick" of it. The false positive is just as significant when using RI.

More generally, when varying the trend, I find that \(p \approx p_{RI}\). Figure 1 shows average rejection rates and p-values for the diff-in-diff estimate across 100 simulations, for different values of differential trends. We see that, on average, conventional and RI p-values are almost identical. As a result, the rejection rates are also similar.

From the discussion above, we expect \(p_{RI}\) to differ when leverage is concentrated, due to small sample size and heterogeneous effects. Since this is not the case here, RI and conventional p-values are similar.

So Gavrilova et al.’s small randomization inference p-value does not prove that their result isn’t driven by differential pre-trends. False positives driven by differential trends also have small RI p-values. Randomization inference is a robustness check for finite sample bias, and nothing more.

Appendix: p-hacking simulations

Here I run simulations where I p-hack a significant result in different setups, to see whether randomization inference can kill a false positive. I calculate a RI p-value by reshuffling the treatment variable and calculating the t-statistic. I repeat this process 1000 times, and calculate \(p_{RI}\) as the fraction of randomized t-statistics that are larger (in absolute value) than the original t-statistic. According to Young (2019), using the t-statistic produces better performance than using the coefficient.

(1) Simple OLS

Constant effects: \(\beta=0\)

Data-generating process (DGP) with \(\beta=0\):

\[\tag{2} y_{i} = \sum_{k=1}^{K} \beta_{k} X_{k,i} + \varepsilon_{i} = \varepsilon_{i}\]

I p-hack a significant result by regressing \(y\) on \(X_{k}\) for \(k=1:K=20\), and selecting the \(X_{k}\) with the smallest p-value. I use \(N=1000\) and a significance level of \(\alpha=0.05\).

Running 1000 simulations, I find:

An average rejection rate of 5.1% for the \(K=20\) regressions.
653 estimates (65%) are significant.³
Out of these 653, 638 (98%) are significant using RI.

In this simple vanilla setup, RI and classical p-values are basically identical. So randomization inference is completely ineffective at killing false positives in a setting with large sample size and homogeneous effects.

Heterogeneous effects: \(\beta_{i} \sim N(0,1)\)

DGP with \(\beta_{k,i} \sim N(0,1)\):

\[\tag{3} y_{i} = \sum_{k=1}^{K} \beta_{k,i} X_{k,i} + \varepsilon_{i}\]

Again, I p-hack by cycling through the \(X_{k}\)’s and selecting the most significant one.

From 1000 simulations, I find:

An average rejection rate of 5.2%.
645 estimates (65%) are significant.
Out of these 645, 638 (99%) are significant using RI.

So even with heterogeneous effects, \(N=1000\) is enough to avoid finite sample bias, so RI p-values are no different.

(2) Difference in differences

Next I simulate panel data and estimate a diff-in-diff model as above, but with no differential trends.

Constant effects: \(\beta=0\)

DGP:

\[\tag{4} y_{st} = \beta D_{st} + \gamma_{s} + \gamma_{t} + \varepsilon_{st}\]

I simulate panel data for 50 states over 1995-2015. 10 states are treated, with treatment years selected randomly in 2000-2010; so this is a staggered rollout diff-in-diff. I draw state and year fixed effects separately from \(N(0,1)\). To generate a false positive, I set \(\beta=0\).

I p-hack a significant result by regressing \(y\) on K different treatment assignments \(D_{k,st}\) in a two-way fixed effects model, and selecting the regression with the smallest p-value. I cluster standard errors at the state level.

From 1000 simulations, I get

An average rejection rate of 7.9%.
815 estimates (82%) are significant.
Out of these 815, 653 (80%) are significant using RI.

So now we are seeing a size distortion using conventional p-values: out of the \(K=20\) regressions, 7.9% are significant, instead of the expected 5%. This appears to be driven by a small sample size and imbalanced treatment: 10 out of 50 states are treated. When I redo this exercise with \(N=500\) states and 250 treated, I get the expected 5% rejection rate.

However, RI doesn’t seem to be much of an improvement, as the majority of false positives remain significant using \(p_{RI}\).

Heterogeneous effects: \(\beta_{s} \sim N(0,1)\)

DGP:

\[\tag{5} y_{st} = \beta_{s} D_{st} + \gamma_{s} + \gamma_{t} + \varepsilon_{st}\]

Now I repeat the same exercise, but with the 10 treated states having treatment effects drawn from \(N(0,1)\).

From 1000 simulations, I get

An average rejection rate of 7.7%.
790 estimates (79%) are significant.
Out of these 790, 687 (87%) are significant using RI.

With heterogeneous effects as well as imbalanced treatment, RI performs even worse at killing the false positive.

What does it take to get \(p_{RI} \neq p\)?

I can get the RI p-value to differ by 0.1 (on average) when using \(N=20\), \(X \sim B(0.5)\), and \(\beta_{i} \sim\) lognormal\((0,2)\). So it takes a very small sample, highly heterogeneous treatment effects, and a binary treatment variable to generate a finite sample bias that is mitigated by randomization inference. Here’s a graph showing how \(|p-p_{RI}|\) varies with \(Var(\beta_{i})\):

So it is possible for RI p-values to diverge substantially from conventional p-values, but it requires a pretty extreme scenario.

Footnotes

See here for R code.

With a properly-sized test, \(P(\)reject \(H_{0} \mid H_{0}) = \alpha\). ↩
Another paper that uses a RI strategy is Yao and Zhang (2015). They use RI on a three-way fixed-effects model, regressing GDP growth on leader, city, and year FEs. They give a similar rationale for randomization inference:

Our second robustness check is a placebo test that permutes leaders’ tenures. If our estimates of leader effects only picked up heteroskedastic shocks, we would have no reason to believe that these shocks would form a consistent pattern that follows the cycle of leaders’ tenures. For that, we randomly permute each leader’s tenures across years within the same city and re-estimate Eq. (1). [...] We find that the F-statistic from the true data is either Nos. 1 or 2 among the F-statistics from any round of permutation. This result gives us more confidence in our baseline results.

↩
As expected, since when \(\alpha=0.05\), P(at least one significant) = \(1 -\)P(none significant) = \(1 - (1-0.05)^{20}\) = 0.64. ↩

How to p-hack a robust result

2021-01-16T20:00:00+00:00

Economists want to show that our results are robust, like in Table 1 below: Column 1 contains the baseline model, with no covariates, and Column 2 controls for \(z\). Because the coefficient on \(X\) is stable and significant across columns, we say that our result is robust.

The twist: I p-hacked this result, using data where the true effect of \(X\) is zero.

In this post, I show that it can be easy to p-hack a robust result like this. Here’s the basic idea: first, p-hack a significant result by running regressions with many different treatment variables, where the true treatment effects are all zero. For 20 regressions, we expect to get one false positive: a result with \(p<0.05\). Then, using this significant treatment variable, run a second regression including a control variable, to see whether the result is robust to controls.

It turns out that the key to p-hacking robust results is to use control variables that have a low partial-\(R^{2}\). These variables don’t have much influence on our main coefficient when excluded from the regression, and also have little influence when included. In contrast, controls with high partial-\(R^{2}\) are more likely to kill a false positive. Lesson: high partial-\(R^{2}\) controls are an effective robustness check against false positives.

Setup

Let’s see how this works. Consider data for \(i=1, ..., N\) observations generated according to

\[\tag{1} y_{i} = \sum_{k=1}^{K} \beta_{k} X_{k,i} + \gamma z_{i} + \varepsilon_{i}.\]

We have \(K\) potential treatment variables, \(X_{1,i}\) to \(X_{K,i}\), and a control variable \(z_{i}\). I draw \(X_{k,i} \sim N(0,1)\), \(z_{i} \sim N(0,1)\), and \(\varepsilon_{i} \sim N(0,1)\), so that \(X_{k,i}\), \(z_{i}\), and \(\varepsilon_{i}\) are all independent, but could be correlated in the sample. I set \(\beta_{k}=0\) for all \(k\), so that \(X_{k}\) has no effect on \(y\), and the true model is

\[\tag{2} y_{i} = \gamma z_{i} + \varepsilon_{i}.\]

I’m going to p-hack using the \(X_{k}\)’s, running \(K\) regressions and selecting the \(k^{*}\) with the smallest p-value. I p-hack the baseline regression of \(y\) on \(X_{k}\), by running \(K\) regressions of the form

\[\tag{3} y_{i} = \alpha_{1,k} + \beta_{1,k}X_{k,i} + \nu_{i}. % kinda abusing notation here, since beta_{1,k} is not the coefficient in the DGP (beta_{k})\]

I use the ‘1’ subscript to indicate that this is the baseline model in Column 1. Out of these \(K\) regressions, I select the \(k^{*}\) with the smallest p-value on \(\beta_{1}\). That is, I select the regression

\[\tag{4} y_{i} = \alpha_{1,k^{*}} + \beta_{1,k^{*}}X_{k^{*},i} + \nu_{i}.\]

When \(K\geq 20\), we expect \(\hat{\beta}_{1,k^{*}}\) to have \(p<0.05\), since with a 5% significance level (i.e., false positive rate), the average number of significant results is \(20\times0.05 = 1\). This is our p-hacked false positive.

To get a robust sequence of regressions, I need my full model including \(z\) to also have a significant coefficient on \(X_{k^{*},i}\). To test this, I run my Column 2 regression:

\[\tag{5} y_{i} = \alpha_{2,k^{*}} + \beta_{2,k^{*}}X_{k^{*},i} + \gamma z_{i} + \varepsilon_{i}.\]

Given that we p-hacked a significant \(\hat{\beta}_{1,k^{*}}\), will \(\hat{\beta}_{2,k^{*}}\) also be significant?

Homogeneous \(\beta=0\)

First, I show a case where p-hacked results are not robust. I use the data-generating process from above with \(\beta=0\).

When regressing \(y\) on \(X_{k}\) in the p-hacking step, we have

\[\tag{6} y_{i} = \alpha_{1,k} + \beta_{1,k} X_{k,i} + \nu_{i},\]

where

\[\tag{7} \begin{align} \nu_{i} &= \sum_{j \neq k}^{K} \beta_{1,j} X_{j,i} + \gamma z_{i} + \varepsilon_{i} \\ &= \gamma z_{i} + \varepsilon_{i}. \end{align}\]

We estimate the slope coefficient as

\[\tag{8} \hat{\beta}_{1,k} = \frac{\widehat{Cov}(X_{k},y)}{\widehat{Var}(X_{k})} = \frac{\gamma \widehat{Cov}(X_{k},z) + \widehat{Cov}(X_{k},\varepsilon)}{\widehat{Var}(X_{k})}.\]

Since \(\beta=0\), we should only find a significant \(\hat{\beta}_{1,k}\) due to a correlation between \(X_{k}\) and the components of the error term \(\nu_{i}\): (1) \(\gamma \widehat{Cov}(X_{k},z)\), and (2) \(\widehat{Cov}(X_{k},\varepsilon)\).

When \(\gamma \widehat{Cov}(X_{k},z)\) is the primary driver of \(\hat{\beta}_{1,k}\), controlling for \(z\) in Column 2 will kill the false positive.

Turning to the full regression in Column 2, we get

\[\begin{align} \tag{9} \hat{\beta}_{2,k} &= \frac{\widehat{Cov}(\hat{u},y)}{\widehat{Var}(\hat{u})} = \frac{\widehat{Cov}((X_{k} - \hat{\lambda}_{1} z),\varepsilon)}{\widehat{Var}(\hat{u})} \\ &= \frac{\widehat{Cov}(X_{k},\varepsilon) - \hat{\lambda}_{1} \widehat{Cov}(z,\varepsilon)}{\widehat{Var}(\hat{u})}. \end{align}\]

This is from the two-step Frisch-Waugh-Lovell method, where we first regress \(X_{k}\) on \(z\) (\(X_{k} = \lambda_{0} + \lambda_{1} z + u\)) and take the residual \(\hat{u} = X_{k} - \hat{\lambda}_{0} - \hat{\lambda}_{1} z\). Then we regress \(y\) on \(\hat{u}\), using the variation in \(X_{k}\) that’s not due to \(z\), and the resulting slope coefficient is \(\hat{\beta}_{2,k}\).¹ We can see that controlling for \(z\) literally removes the \(\gamma \widehat{Cov}(X_{k},z)\) term from our estimate.

Hence, to p-hack robust results, we want \(\hat{\beta}_{1,k}\) to be driven by \(\widehat{Cov}(X_{k},\varepsilon)\), since that term is also in \(\hat{\beta}_{2,k}\). If we have a significant result that’s not driven by \(z\), then controlling for \(z\) won’t affect our significance.

Simulations

Setting \(K=20, N=1000\), and \(\gamma=1\), I perform \(1000\) replications of the above procedure: I run 20 regressions, select the most significant \(X_{k^{*}}\) and record the p-value on \(\hat{\beta}_{1,k^{*}}\), then add \(z\) to the regression and record the p-value on \(\hat{\beta}_{2,k^{*}}\). As expected when using a \(5\%\) significance level, I find that out of the \(K\) regressions in the p-hacking step, the average number of significant results is \(0.05\). I find that \(\hat{\beta}_{1,k^{*}}\) is significant in 663 simulations (=66%). But only 245 simulations (=25%) have both a significant \(\hat{\beta}_{1,k^{*}}\) and a significant \(\hat{\beta}_{2,k^{*}}\), meaning that only 37% (=245/663) of p-hacked Column 1 results have a significant Column 2. So in the \(\beta=0\) case, we infer that \(\widehat{Cov}(X_{k},\varepsilon)\) is small relative to \(\gamma \widehat{Cov}(X_{k},z)\). With these parameters, it’s not easy to p-hack robust results.

Figure 1 repeats this process for a range of \(\gamma\)’s. I plot the shares of \(\gamma \widehat{Cov}(X_{k},z)\) and \(\widehat{Cov}(X_{k},\varepsilon)\) in \(\hat{\beta}_{1,k}\).² We see that when \(\gamma=0\), \(\gamma \widehat{Cov}(X_{k},z)\) has 0 weight, but its share increases quickly. Closely correlated with this share is the fraction of significant results losing significance after controlling for \(z\). Specifically, this is the fraction of simulations with a nonsignificant \(\hat{\beta}_{2,k}\), out of the simulations with a significant \(\hat{\beta}_{1,k}\). And even more tightly correlated with \(\gamma \widehat{Cov}(X_{k},z)\) is the partial \(R^{2}\) of \(z\).³ Intuitively, as \(\gamma\) increases, the additional improvement in model fit from adding \(z\) also increases, which by definition increases \(R^{2}(z)\). Hence, \(R^{2}(z)\) turns out to be a useful proxy for the share of \(\gamma \widehat{Cov}(X_{k,i},z_{i})\), which we can’t calculate in practice. Lesson: when partial-\(R^{2}(z)\) is large, controlling for \(z\) is an effective robustness check for false positives. This is because a large \(\gamma \widehat{Cov}(X_{k},z)\) implies both (1) a large \(R^{2}(z)\); and (2) that \(z\) is more likely to be the source of the false positive, and hence controlling for \(z\) will kill it. So now we have a new justification for including control variables, apart from addressing confounders: to rule out false positives driven by coincidental sample correlations.

Heterogeneous \(\beta_{i} \sim N(0,1)\)

However, you might think that \(\beta=0\) is not a realistic assumption. As Gelman says: “anything that plausibly could have an effect will not have an effect that is exactly zero.” So let’s consider the case of heterogeneous \(\beta_{i}\), where each individual \(i\) has their own effect drawn from \(N(0,1)\). For large \(N\), the average effect of \(X\) on \(y\) will be 0, but this effect will vary by individual. This is a more plausible assumption than \(\beta\) being uniformly 0 for everyone. And as we’ll see, this also helps for p-hacking, by increasing the variance of the error term.

Here we have data generated according to

\[\tag{10} y_{i} = \sum_{k=1}^{K} \beta_{k,i} X_{k,i} + \gamma z_{i} + \varepsilon_{i},\]

where \(\beta_{k,i} \sim N(0,1)\).

Then, when regressing \(y\) on \(X_{k}\), we have

\[\tag{11} y_{i} = \alpha_{1,k} + \delta_{1,k} X_{k,i} + v_{i},\]

where

\[\tag{12} v_{i} = -\delta_{1,k} X_{k,i} + \beta_{k,i} X_{k,i} + \sum_{j \neq k}^{K} \beta_{j,i} X_{j,i} + \gamma z + \varepsilon_{i}.\]

When effects are heterogeneous (i.e., we have \(\beta_{k,i}\) varying with \(i\)), a regression model with a constant slope \(\delta_{1,k}\) is misspecified. To emphasize this, I include \(-\delta_{1,k} X_{k,i}\) in the error term.⁴

The estimated slope coefficient is

\[\tag{13} \begin{align} \hat{\delta}_{1,k} &= \frac{\widehat{Cov}(X_{k,i},y_{i})}{\widehat{Var}(X_{k,i})} \\ &= \frac{\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i}) + \gamma \widehat{Cov}(X_{k,i},z_{i}) + \widehat{Cov}(X_{k,i},\varepsilon)_{i}}{\widehat{Var}(X_{k,i})} \end{align}\]

From Aronow and Samii (2015), we know that the slope coefficient converges to a weighted average of the \(\beta_{k,i}\)’s:

\[\tag{14} \hat{\delta}_{1,k} \rightarrow \frac{E[w_{i} \beta_{k,i}]}{E[w_{i}]},\]

where \(w_{i}\) are the regression weights: the residuals from regressing \(X_{k}\) on the other controls. In this case, as we’re using a univariate regression, the residuals are simply demeaned \(X_{k}\) (when regressing \(X\) on a constant, the fitted value is \(\bar{X}\)).

Because \(\beta_{k,i} \sim N(0,1)\), we have \(E[w_{i} \beta_{k,i}]=0\) and hence \(\hat{\delta}_{1,k}\) converges to 0. So any statistically significant \(\hat{\delta}_{1,k}\) that we estimate will be a false positive.

There are three terms that make up \(\hat{\delta}_{1,k}\) and could drive a false positive: (1) \(\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i})\), (2) \(\gamma \widehat{Cov}(X_{k},z)\), and (3) \(\widehat{Cov}(X_{k},\varepsilon)\).

Now we have a new source of false positives, case (1), due to heterogeneity in \(\beta_{k,i}\). Note that controlling for \(z\) will only affect one out of three possible drivers, so now we should expect our false positives to be more robust to control variables, compared to when \(\beta=0\). To see this, note that when controlling for \(z\) in the full regression, we have

\[\tag{15} \begin{align} \hat{\delta}_{2,k} &= \frac{\widehat{Cov}(\hat{u}_{i},y_{i})}{\widehat{Var}(\hat{u}_{i})} \\ &= \frac{\sum_{j=1}^{K} \widehat{Cov}(X_{k,i} - \hat{\lambda}_{1} z_{i},\beta_{j,i}X_{j,i}) + \widehat{Cov}(X_{k,i} - \hat{\lambda}_{1} z_{i},\varepsilon_{i})}{\widehat{Var}(\hat{u}_{i})} \\ &= \frac{\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i}) + \widehat{Cov}(X_{k,i},\varepsilon_{i})}{\widehat{Var}(\hat{u_{i}})} \\ &- \hat{\lambda}_{1} \frac{\left[ \sum_{j=1}^{K} \widehat{Cov}(z_{i},\beta_{j,i}X_{j,i}) + \widehat{Cov}(z_{i},\varepsilon_{i}) \right] }{\widehat{Var}(\hat{u_{i}})} \end{align}\]

Here \(\hat{u}\) is the residual from a regression of \(X_{k}\) on \(z\): \(X_{k} = \lambda_{0} + \lambda_{1} z + u\). We obtain \(\hat{\delta}_{2,k}\) by regressing \(y\) on \(\hat{u}\), via FWL, and using the variation in \(X_{k}\) that’s not due to \(z\).

Comparing \(\hat{\delta}_{1,k}\) to \(\hat{\delta}_{2,k}\), we see that \(\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i}) + \widehat{Cov}(X_{k_{i}},\varepsilon_{i})\) shows up in both estimates. Hence, if our p-hacking selects for a \(\hat{\delta}_{1,k}\) with a large value of these terms, we’re also selecting for the majority of the components of \(\hat{\delta}_{2,k}\). In contrast to the \(\beta=0\) case, now we should expect \(\gamma \widehat{Cov}(X_{k,i},z_{i})\) to be dominated, and significance in Column 1 should carry over to Column 2.

Simulations

I repeat the same procedure as before, running \(K=20\) regressions of \(y\) on \(X_{k}\) and \(z\), taking the \(X_{k}\) with the smallest p-value, \(X_{k^{*}}\), and then running another regression while excluding \(z\). Again, I use \(\gamma=1\) and perform \(1000\) replications. Here I use robust standard errors to address heteroskedasticity.

I find that \(\hat{\delta}_{1,k^{*}}\) is significant in 650 simulations (=65%). But this time, 569 simulations (=57%) have both a significant \(\hat{\delta}_{1,k^{*}}\) and a significant \(\hat{\delta}_{2,k^{*}}\). So 88% (=569/650) of p-hacked Column 1 estimates also have a significant Column 2. Compare this to 37% in the \(\beta=0\) case. That’s what I call p-hacking a robust result! We infer that \(\gamma \widehat{Cov}(X_{k,i},z_{i})\) is too small relative to the other components for its presence or absence to affect our estimates very much.

To illustrate how \(\hat{\delta}_{1,k}\) is determined, I plot the shares of its three constituent terms while varying \(\gamma\).⁵

As shown in Figure 2, when \(\gamma\) is small, most of the weight in \(\hat{\delta}_{1,k}\) is from \(\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i})\), indicating that its \(K\) terms provide ample opportunity for correlations with \(X_{k^{*},i}\). But as \(\gamma\) increases, this share falls, while the share of \(\gamma \widehat{Cov}(X_{k,i},z_{i})\) rises linearly. The share of \(\widehat{Cov}(X_{k,i},\varepsilon_{i})\) is small and decreases slightly. Looking at robustness, we see that the fraction of significant results losing significance rises much more slowly than in the \(\beta=0\) case. And we again see a tight link between partial-\(R^{2}(z)\) and the share of \(z\) in \(\hat{\delta}_{1,k}\).⁶

Overall, we can see why controlling for \(z\) is less effective with heterogeneous effects: \(\hat{\delta}_{1,k}\) is mostly not determined by \(\gamma \widehat{Cov}(X_{k,i},z_{i})\), so removing it (by controlling for \(z\)) has little effect. In other words, when variables have low partial-\(R^{2}\), controlling for them won’t affect false positives.

Conclusion

In general, economists think about robustness in terms of addressing potential confounders. I haven’t seen any discussion of robustness to false positives based on coincidental sample correlations. This is possibly because it seems hopeless: we always have a 5% false positive rate, after all. But as I’ve shown, adding high partial-\(R^{2}\) controls is an effective robustness check against p-hacked false positives.⁷ So we have a new weapon to combat false positives: checking whether a result remains significant as high partial-\(R^{2}\) controls are added to the model.

Footnotes

See here for R code. Click here for a pdf version of this post.

\(\widehat{Cov}(\hat{u},y) = \widehat{Cov}(\hat{u},\gamma z + \varepsilon) = \gamma \widehat{Cov}(\hat{u},z) + \widehat{Cov}(\hat{u},\varepsilon) = 0 + \widehat{Cov}(\hat{u},\varepsilon)\), since the residual \(\hat{u}\) is orthogonal to \(z\). ↩
Note that these terms can be negative, so this is not strictly a share in \([0,1]\). When the terms in the denominator almost cancel out to 0, we get extreme values. Hence, for each \(\gamma\), I take the median share across all simulations, which is well-behaved. ↩
\(R^{2}(z) = \frac{\sum \hat{u}_{i}^{2} - \sum \hat{v}_{i}^{2}}{\sum \hat{u}_{i}^{2}}\), where \(\hat{u}_{i}^{2}\) is the residual from the baseline model, and \(\hat{v}_{i}^{2}\) is the residual from the full regression (where we control for \(z\)). In other words, partial \(R^{2}(z)\) is the proportional reduction in the sum of squared residuals from adding \(z\) to the model. See also the R function and Wikipedia. ↩
We could write \(\beta_{k,i} = \bar{\beta}_{k,i} + (\beta_{k,i}-\bar{\beta}_{k,i}) := b_{k} + b_{k,i}\), and then have \(y_{i} = \alpha_{1,k} + b_{k} X_{k,i} + v_{i}\), with \(v_{i} = b_{k,i}X_{k,i} + \sum_{j \neq k}^{K} \beta_{j,i} X_{j,i} + \gamma z_{i} + \varepsilon_{i}\). However, \(\hat{b}_{k}\) does not generally converge to \(b_{k} = \bar{\beta}_{k,i}\), as I discuss below. ↩
Similar results hold when varying \(Var(\beta_{i})\) or \(Var(\varepsilon)\). ↩
Note that the overall \(R^{2}\) in Column 1 is irrelevant. For \(\alpha=0.05\), we will always have a false positive rate of 5% when the null hypothesis is true. Controlling for \(z\) is effective when \(\gamma \widehat{Cov}(X_{k,i},z_{i})\) has a large share in \(\hat{\delta}_{1,k}\). And a large share also means that \(R^{2}(z)\) is large. This is true whether the overall \(R^{2}\) is 0.01 or 0.99, since partial \(R^{2}\) is defined in relative terms, as the decrease in the sum of squared residuals relative to a baseline model. ↩
Note that this holds regardless of which regression is p-hacked. Here, I’ve p-hacked the baseline regression. But the results are actually identical when you work backwards, p-hacking the full regression and then excluding \(z\). ↩

Why we shouldn't take p-values literally

2020-11-04T23:20:00+00:00

PSA: if you read a paper claiming p<0.01, you shouldn’t automatically take that p-value literally.

By definition, a p-value is the probability of getting a result at least as extreme as the one in your sample, assuming the null hypothesis H0 is true. In other words, assuming H0 is true, if you collected many more samples, and ran the same testing procedure, you’d expect to find that p% of the results are more extreme than the original result. Notice: this requires running the exact same test on every sample.

This is equivalent to preregistering your test. If, instead, you first looked at the data before designing your test (to check for correlations, try different dependent variables, or even engage in outright p-hacking), you would violate the definition. In this data-dependent testing procedure, for every new sample you collect, you look at the data and then choose a test. That is, for every new sample you would potentially run a different test, which contradicts the requirement to run the same test.

Hence, the burden of proof lies with the researcher to convince us that, given a new sample, they would run the same test. Preregistration is the easiest way to do this: it guarantees that the researcher chose their test before looking at the data. But if the hypothesis is not preregistered, and looks like the garden of forking paths, then we should feel free to reject their p-value as invalid.

For example, suppose a paper reports a null average effect and poorly-motivated p<0.01 interaction effect. This looks suspiciously like they started out expecting an average effect, didn’t find one, so scrambled to find something, anything with three stars, then made up a rationalization for why that interaction was what they were planning all along. But obviously this is data-dependent testing. We have little reason to believe that the researcher would run the same test on other samples. Hence, we also have little reason to take the p-value seriously.

Lesson: consumers of non-preregistered research need to do more than just look at the p-value. We also have to make a judgment call about whether that p-value is well-defined.

The blog of Michael Wiebe

A YIMBY FAQ

What caused the housing crisis?

Isn’t the problem a lack of affordable housing in particular?

But new homes are not affordable, so building more will not help the people with the greatest need.

You’re really saying that building luxury homes will solve the housing crisis?

We’ve seen cities increasing housing supply, but prices keep going up. Doesn’t this disprove your supply theory?

If more density was the solution, shouldn’t New York City be the most affordable city in the world?

What about induced demand? Increasing supply just draws in more demand.

Doesn’t upzoning increase land values? Higher land costs mean more expensive homes.

Are there any upzoning success stories?

What other factors have contributed to rising housing costs?

People want to live in single-family homes, not apartments.

If we only have apartments, no one will have kids and the population will collapse.

This is an expensive city. If you can’t afford to live here, move somewhere else.

Zoning policy should prioritize maintaining the neighborhood character for current residents.

Who are the winners and losers from upzoning?

To make housing more affordable, do we need to make it a bad investment for current homeowners?

So how do NIMBYs explain the housing crisis?

Do YIMBYs favor cutting immigration to reduce housing costs?

Do YIMBYs support developers?

If YIMBY is correct, why haven’t you won yet?

Isn’t upzoning violating the property rights of homeowners?

When I moved into this neighborhood, it was zoned for single-family houses. Changing that is a violation of an implicit contract.

But property rights aren’t absolute. The local government democratically chose single-family zoning, so it’s undemocratic for a state government to force upzoning.

Upzoning will destroy heritage neighborhoods

A supply and demand model of housing, part 2: unit demand

A supply and demand equilibrium

Vacancy chains

Demand cascade and yuppie fishtank

A supply and demand model of housing, part 1: continuous quantities

A supply and demand equilibrium

Vacancy chains

Demand cascades and yuppie fishtanks

Notes on the new measles literature

Summary: replication of Moretti (2021)

Can we detect the effects of racial violence on patenting? Replicating Cook (2014)

Time series regressions

Panel data regressions

Conclusion

Footnotes

Did medical marijuana legalization reduce crime? A replication exercise

Summary

Introduction

Weighting

Modelling the dependent variable

Level-level model: homicides

Log-level model: homicides

Poisson model: homicides

Event study

Event study: violent crimes (binning 5+)

Event study: homicides (binning 5+)

Event study: robberies, unweighted (binning 5+)

Event study: assaults (binning 5+)

Event study: homicides

Event study: robberies

Event study: assaults

Synthetic control

Randomization inference

Conclusion

Footnotes

Level-level model: robberies

Log-level model: robberies

Poisson model: robberies

Level-level model: assaults

Log-level model: assaults

Poisson model: assaults

Event study: robberies, weighted (binning 5+)

Level-level model: homicides

Level-level model: robberies

Level-level model: assaults

How I use regression weights to replicate research

Does meritocratic promotion explain China's growth?

Footnotes

Replicating the literature on meritocratic promotion in China

Yao and Zhang (2015)

Li et al. (2019)

Chen and Kung (2019)

Conclusion

Footnotes