The blog of Michael Wiebe2022-04-04T01:06:37+00:00http://michaelwiebe.com/Michael WiebeCan we detect the effects of racial violence on patenting? Replicating Cook (2014)2022-04-03T19:00:00+00:00http://michaelwiebe.com/blog/2022/04/cook_replication<p>A year ago, I wrote a <a href="https://michaelwiebe.com/blog/2021/02/cook_violence">short post</a> looking at the data in <a href="https://link.springer.com/article/10.1007/s10887-014-9102-z">Cook (2014)</a> (<a href="https://twitter.com/sci_hub_">sci-hub</a>) (<a href="https://link.springer.com/article/10.1007/s10887-014-9102-z#Sec20">replication files</a>) on the effect of racial violence on African American patents over 1870-1940.
I discovered that the state-level panel data was strikingly imbalanced.
With Lisa Cook in the news for being nominated to the Federal Reserve Board of Governors, I decided to revisit the paper more thoroughly.
I find that the main time series result is not robust, and provide evidence that the panel data results are too noisy to be trusted.</p>
<h2 id="time-series-regressions">Time series regressions</h2>
<p>Cook has two measures of patents per year: (1) using the year the patent was applied for, and (2) using the year the patent was granted.
In the paper, Figure 1 reports Black (and white) patents per million using grant year, while Figure 2 shows Black patents per million using application year.
Comparing the two graphs, we immediately see that the scale differs by a factor of about 10.
Here I merge the two datasets and plot the application-year and grant-year variables on the same graph.</p>
<p><img src="https://michaelwiebe.com/assets/cook_replication/fig_1_2.png" alt="" width="100%" /></p>
<p>There is a huge discrepancy between the two patent variables.
Cook collected data on 726 patents over 1870-1940, but the average by grant-year is 0.16, while the average by application-year is 1.22.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<!-- But even if the actual variable is 'patents per million by grant year', why is there a discrepancy between grant-year and application-year? Recall that the average values are 0.16 and 1.22. -->
<p>Cook’s replication data does not include the raw patent or population variables, so we can’t say for sure what’s going on here.
But the average <a href="https://www.census.gov/content/dam/Census/library/working-papers/2002/demo/POP-twps0056.pdf">Black population</a> (see Table 1) was roughly 10 million, and 0.16 grant-year patents/M * 10M * 71 years = 114, far fewer than the 726 patents recorded.
In contrast, 1.22 application-year patents/M * 10M * 71 years = 866, which is in the ballpark of 726.
Speculating, one possible explanation is that Cook calculated grant-year patents using the white population (average 75 million) in the denominator, giving 0.16 * 75 * 71 = 852 patents.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>
Hopefully Cook will publish the raw data and we can resolve this.</p>
<p>In any case, the grant-year patent variable seems clearly flawed, while the application-year variable looks correct.
Since the Table 6 results use the grant-year patent variable, we should run a robustness check using the application-year variable.</p>
<p>Table 6 uses time series data to estimate the effect of lynchings, riots, and segregations laws on patents.
Column 1 uses race-year panel data, where the lynching and patent variables vary by race (but the riot and segregation law variables vary only by time).
Columns 2 and 3 run time series regressions separately by race, allowing us to estimate differential effects of racial violence on patenting.</p>
<p>I am able to reproduce Table 6<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>, using grant-year patents:</p>
<p><img src="https://michaelwiebe.com/assets/cook_replication/table6a.png" alt="" width="85%" /></p>
<p>As noted in the paper, lynchings and riots have negative effects on Black patenting, and the 1921 dummy has a large negative effect, corresponding to the Tulsa Race Riot.</p>
<p>For the robustness check, I redo Table 6 using application-year patents instead of grant-year patents.
This specification actually seems more appropriate, since Cook’s mechanism is that racial violence deters innovation by Black inventors; so racial violence would first impact patent <em>applications</em>, and with a lag impact <em>granted</em> patents. So the effects should be stronger using the application-year variable.</p>
<p>The application-year variable is missing in 1940, which reduces the sample size for the robustness check by 1. To make a pure comparison, I re-run the grant-year regressions dropping 1940, and get similar results (see footnote<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup>).
Next, I run Table 6 using application-year patents:</p>
<p><img src="https://michaelwiebe.com/assets/cook_replication/table6c.png" alt="" width="85%" /></p>
<p>The results are dramatically different: the negative effect of lynchings and riots disappears, as does the negative effect in 1921.
If the grant-year patent variable is incorrect and the application-year variable is correct, then the paper’s main result is wrong.</p>
<h2 id="panel-data-regressions">Panel data regressions</h2>
<p>In Tables 7 and 8, Cook uses state-level panel data over 1870-1940 to run regressions of patents on lynching rates, riots, and segregation laws.
However, we can immediately see a problem: there are 49 states and 71 years in the data, but only N=430 observations. A complete, balanced panel would have 3210 observations, as the number of states grows from 38 in 1870 to 49 in 1940 (including DC; see <a href="https://github.com/maswiebe/metrics/blob/main/cook_replication.do">code</a> for details).
So Cook is using 430/3210 = 13% of the full sample.</p>
<p>And the pattern of missing data is not random.
Below I plot the number of observations by state and year.
First, we see that the majority of states have fewer than 10 observations over 71 years.
<img src="https://michaelwiebe.com/assets/cook_replication/obs_state.png" alt="" width="100%" /></p>
<p>Next, the sample size is increasing up to 1900 before dropping off and rising again starting in 1920.
<img src="https://michaelwiebe.com/assets/cook_replication/obs_year.png" alt="" width="100%" /></p>
<p>Decomposing by region, we see that the Midwest and Mid-Atlantic regions are relatively overrepresented, while the South and West are relatively underrepresented.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup>
<img src="https://michaelwiebe.com/assets/cook_replication/obs_region.png" alt="" width="100%" /></p>
<p>Moreover, consider how this imbalanced panel compares to the full time series.
There are 726 patents in the time series, and 702 in the panel data (for 97% coverage).
But the violence variables are drastically under-reported: there are 35 riots in the time series data, but only 5 in the panel data (14%).
Similarly, there are 290 new segregation laws in the time series data, but only 19 in the panel data (7%).<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>
(The same problem applies with lynchings, but the replication files don’t have count data, so we can’t quantify it.)</p>
<p>What explains the missing data? It appears that Cook dropped any state-year observation that had a variable with a missing value. The resulting dataset has no variables with missing values, but a lot of missing state-year observations, and hence a severely imbalanced panel.</p>
<p>With this low level of data coverage, I’m skeptical of the panel data results in Tables 7 and 8.
It’s possible that these results are unbiased, and would remain stable as the missing data was filled in (through a law of large numbers argument). Especially considering the prior plausibility that racial violence and patents are negatively correlated, we should place some weight on this.</p>
<p>But it’s also possible that they’re false positives. And statistically significant results are easy to get when you’re working with small effects and noisy data.
For example, let’s check for heterogeneous effects by region; a robust result should be stable across different cuts of the data.
From Table 7, I run the Column 1 regression separately for each region:</p>
<p><img src="https://michaelwiebe.com/assets/cook_replication/table7_region.png" alt="" width="100%" /></p>
<p>The lynchings estimate for the South (-0.075) is similar to the average effect from the full sample (-0.058).
But there’s no estimate at all for the Midwest and Northeast, since there were zero lynchings in those regions.
The estimate for the Mid-Atlantic is huge with two stars, 200x bigger than the South estimate.
But this is almost certainly a <a href="https://cran.r-project.org/web/packages/retrodesign/vignettes/Intro_To_retrodesign.html">Type M</a> error (an overestimate of the true effect), as the lynching rate for the Mid-Atlantic is 3% of the average.</p>
<p>With only 5 riots in the dataset, it’s no surprise that there’s no estimate for the Midwest, Northeast, or West regions (which had zero riots in this data). The effect size is somewhat similar for the South and Mid-Atlantic, perhaps indicating a more homogeneous effect of riots on patenting.</p>
<p>For segregation laws, the Table 7, Column 1 estimate is -0.1. The effects for the South and West are in the ballpark, at -0.19 and -0.16. But the effects for the Midwest and Mid-Atlantic are positive, massive, and have three stars!
But statistical significance doesn’t mean anything here, because the data is noisy.
There are 19.33 new segregation laws in the data, with 17 in the South, 1 in the Midwest, 1 in the West, and 0.33 in the Mid-Atlantic (presumably a data error).</p>
<p>Another way to assess noisy data is to decompose the patenting variable by economic category.
In fact, Cook does this in Table 8, running separate regressions for assigned patents (e.g., the patentee sells their patent to a firm), mechanical patents, and electrical patents (note that mechanical and electrical patents can be assigned or not).<sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup></p>
<p><img src="https://michaelwiebe.com/assets/cook_replication/table8_original.png" alt="" />
<!-- {:width="60%"} --></p>
<p>(For comparison, the Table 7, column 1 estimates are: lynchings -0.058***, riots -0.429***, segregation laws -0.1.)
The lynching estimates are much smaller than in Table 7, and none are statistically significant.<sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote" rel="footnote">9</a></sup>
The riot estimates have the same sign and similar magnitude only for assigned patents.
For segregation laws, the coefficient has the opposite sign for assigned, double the magnitude for mechanical, and half the magnitude for electrical patents.
Overall, there is strong heterogeneity in the effects of racial violence, and Cook does not provide a theory to predict the pattern of varying estimates.
This heterogeneity is more consistent with noise than a clear causal effect.</p>
<p>My takeaway from these subsample results is that the missing data is causing low statistical power, and we’re seeing <a href="https://cran.r-project.org/web/packages/retrodesign/vignettes/Intro_To_retrodesign.html">Type S and Type M errors</a>.
Hence, we shouldn’t place much weight on the correlations in Tables 7 and 8, since they would probably change considerably if we had a complete and balanced panel.</p>
<h2 id="conclusion">Conclusion</h2>
<p>To summarize, the main time series result in Cook (2014) is not robust to using an alternative patent variable, and the panel data results are questionable because of missing data.
Nonetheless, the conclusions remain plausible, because they have a high prior probability. Lynchings, race riots, and segregation laws were a severe problem, and it would be astonishing if they didn’t have pervasive effects on the lives of Black people.</p>
<p>But with the data available, it’s unrealistic to think we can statistically detect causal effects. Credible causal inference would require more complete data as well as an identification strategy more convincing than a panel regression (not to mention modelling temporal and spatial spillovers). Descriptive analysis is the most that this dataset can support, and is a valuable contribution in itself, along with the rich qualitative evidence in the paper.</p>
<p>Cook deserves credit for pursuing this important research question and putting in years of effort to collect the patent data.
And in fact, recent research, no doubt inspired by Cook, does find that <a href="https://academic.oup.com/qje/advance-article-abstract/doi/10.1093/qje/qjab040/6412549">segregation</a> (of the federal government by Woodrow Wilson) and <a href="https://www.nber.org/papers/w28985">riots</a> (specifically, the Tulsa Race Massacre) had substantial negative effects on Black Americans.
I hope that <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3712547">more</a> <a href="https://www.aeaweb.org/articles?id=10.1257/app.20190549">researchers</a> continue in Cook’s footsteps and bring attention to the consequences of America’s racist history.
<!-- Hopefully her example [can](https://academic.oup.com/qje/advance-article-abstract/doi/10.1093/qje/qjab040/6412549) [inspire](https://www.aeaweb.org/articles?id=10.1257/app.20190549) [more](https://www.nber.org/papers/w28985) [researchers](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3712547) to build upon this work and --></p>
<hr />
<p>In terms of computational reproducibility, Cook’s code has several problems:</p>
<ul>
<li>The code for Figures 1, 2, and 3 is in Stata graph editor format, which cannot be run from a do-file.</li>
<li>Figure 1 uses the variable <code class="language-plaintext highlighter-rouge">patgrntpc</code>, patents by grant-year per capita, but the graph refers to patents per million. Similarly, Table 5 reports ‘Patents, per million’, but the code uses <code class="language-plaintext highlighter-rouge">patgrntpc</code>. The variable should be named ‘patents by grant-year per million’.</li>
<li>There’s no code for Table 4.</li>
<li>Equation 1 and Table 6 refer to patents per capita, but the variable in the code, <code class="language-plaintext highlighter-rouge">patgrntpc</code>, has mean values of 0.16 for Blacks and 425 for whites; this is patents per million, not per capita.</li>
<li>The code for Table 6 refers to a variable <code class="language-plaintext highlighter-rouge">LMRindex</code>, but the dataset contains <code class="language-plaintext highlighter-rouge">DLMRindex</code>.</li>
<li>Section 3.2 mentions that the state-level regressions use data over 1882-1940, but the code uses data over 1870-1940.</li>
<li>The code for Table 7 includes a command to collapse the data down to the state-year level, but the data is already in a state-year panel.</li>
<li>The code for Table 7 includes a variable, <code class="language-plaintext highlighter-rouge">estbnumpc</code>, for the number of firms per capita, but it is not included in the dataset.</li>
<li>The code for Column 1 in Table 7 includes the ‘number of firms’ variable, but the paper only includes it in columns 3-6.</li>
<li>In the notes to Tables 7 and 8, Cook writes that “Standard errors robust to clustering on state and year are in parentheses.” However, the code only clusters by state, using <code class="language-plaintext highlighter-rouge">vce(cl stateno)</code>.</li>
<li>The code for Table 8 has an error in its clustering command, using the incorrect syntax <code class="language-plaintext highlighter-rouge">vce(stateno)</code> instead of the correct <code class="language-plaintext highlighter-rouge">vce(cl stateno)</code>.</li>
<li>The code for Table 8 does not exactly reproduce the results in the paper. When I run the code, I get N=429, while Cook’s regressions have N=428. It’s possible that Cook is controlling for firms per capita, as in Table 7, but this variable is not included in the code, and is not mentioned in the table.</li>
<li>The code for Table 9 does not reproduce the results in the paper.</li>
</ul>
<p>There are also a few data errors:</p>
<ul>
<li>State 9 has the South dummy equal to 1 for all years, but also has the Mid-Atlantic dummy equal to 0.33 in 1888.</li>
<li>State 14 has the Midwest dummy equal to 1 in all years except 1886, when both it and the South dummy are 0.5.</li>
<li>State 31 in 1909 has a value of 0.333333 for ‘number of new segregation laws’, which should be integer-valued.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<p>See <a href="https://github.com/maswiebe/metrics/blob/main/cook_replication.do">here</a> for code.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Cook notes that “a comparison of a sample of similar patents obtained by white and African American inventors shows that the time between patent application and grant for the two groups was not significantly different, 1.4 years in each case.” (p.226, fn. 15) Also, there is no application-year patent data for 1870-72. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>This discrepancy becomes even more puzzling when we compare the paper and the code:</p>
<ul>
<li>Figure 1 reports patents per million by grant year, but uses a variable named <code class="language-plaintext highlighter-rouge">patgrntpc</code> with the label ‘Patents by grant year’. The ‘pc’ would seem to indicate patents per capita.</li>
<li>Figure 2 reports patents per million by application year, using a variable <code class="language-plaintext highlighter-rouge">pat_appyear_pm</code>, with ‘pm’ corresponding to ‘per million’.</li>
<li>Table 5 presents descriptive statistics, with a ‘Patents, per million’ variable with a mean of 0.16, but the code uses <code class="language-plaintext highlighter-rouge">patgrntpc</code>.</li>
<li>Equation 1 and Table 6 both refer to patents per capita. The code for Table 6 uses the logarithm of <code class="language-plaintext highlighter-rouge">patgrntpc</code>.</li>
</ul>
<p>Although the variable <code class="language-plaintext highlighter-rouge">patgrntpc</code> would seem to be ‘Patents by grant year, per capita’, this can’t be true: the average value is 0.16 for Blacks, and 425 for whites. These values are clearly measured per million.
So the variable must be misnamed, and actually represents patents per million, as described in Figure 1 and Table 5.
This means that Equation 1 and Table 6 are mistaken: the dependent variable is log patents per million, and <em>not</em> log patents per capita. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Another explanation is that the application-year variable counts all patents that were applied for, including patents that were denied. This is not consistent with the text, where Cook only mentions 726 granted patents. In footnote 15, Cook writes that analyzing “[a]pplication rejection rates […] is beyond the scope of the current paper.” Moreover, even if true, this explanation doesn’t account for why the grant-year variable does not add up to 726. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Cook’s Table 6 incorrectly shows the lynching estimates in Columns 2 and 3 as having p-values less than 0.05. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>Note that N = 110 and 55 instead of 112 and 56.</p>
<p><img src="https://michaelwiebe.com/assets/cook_replication/table6b.png" alt="" width="80%" /> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>Number of states by region: South 15, Midwest 12, Northeast 6, West 12, Mid-Atlantic 7. Eleven states enter after 1870, and hence have fewer than 71 years in the complete panel. See <a href="https://github.com/maswiebe/metrics/blob/main/cook_replication.do">code</a> for details. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>The actual number is 19.33. Somehow, one state-year observation has a value of 0.33 for the number of new segregation laws. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:8" role="doc-endnote">
<p>In Column 4, Cook runs a regression using Southern patents as the dependent variable. That is, while still using the full panel, the patent variable is set to 0 for non-Southern states. This is an incorrect approach for estimating heterogeneous effects. A correct approach would restrict the sample to Southern states, as I did above, or use the full sample and interact the violence variables with a South dummy. <a href="#fnref:8" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:9" role="doc-endnote">
<p>Cook mentions in footnote 49 that lynchings have a negative effect on ‘miscellaneous patents’, but this is not reported in the table, and the variable is not included in the dataset. <a href="#fnref:9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Did medical marijuana legalization reduce crime? A replication exercise2021-03-19T20:00:00+00:00http://michaelwiebe.com/blog/2021/03/mml<h1 id="summary">Summary</h1>
<p>In this post I replicate the <a href="https://academic.oup.com/ej/article/129/617/375/5237193">paper</a> “Is Legal Pot Crippling Mexican Drug Trafficking Organisations? The Effect of Medical Marijuana Laws on US Crime” by Gavrilova, Kamada, and Zoutman (Economic Journal, 2019; <a href="https://michaelwiebe.com/assets/mml/gkz_data.zip">replication files</a>).</p>
<p>I find three main problems in the paper:</p>
<ul>
<li>it uses weighting when its own justification doesn’t apply</li>
<li>it uses a level dependent variable, and isn’t robust to log-level or Poisson models</li>
<li>it does not test for pretrends in the disaggregated crime variables, and two alternative event studies show that the results are driven by differential trends</li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>This paper studies the effect of medical marijuana legalization on crime in the U.S., finding that legalization decreases crime in states that border Mexico. The paper uses a triple-diff method, essentially doing a diff-in-diff for the effect of legalization on crime, then adding an interaction for being a border state.</p>
<p>The paper uses county level data over 1994-2012, with treatment (medical marijuana legalization, MML) occurring at the state level.
The authors use “violent crimes” as the outcome variable in their main analysis, defined as the sum of homicides, robberies, and assaults, where each is measured as a rate per 100,000 population.
They also perform separate analyses for each of the three crime categories.</p>
<p>The basic triple-diff regression is:</p>
\[y_{cst} = \beta^{border} D_{st}B_{s} + \beta^{inland} D_{st} (1-B_{s}) + \gamma_{c} + \gamma_{t} + \varepsilon_{cst}.\]
<p>Here \(y_{cst}\) is the outcome in county \(c\) in state \(s\) in year \(t\); \(D_{st}\) is an indicator for having enacted MML by year \(t\); \(B_{s}\) is an indicator for bordering Mexico; \(\gamma_{c}\) are county fixed effects; \(\gamma_{t}\) are year fixed effects. The full model also includes time-varying controls, border-year fixed effects, and state-specific linear time trends.
The outcome is crime rates per 100,000 population, measured in levels, so the regression coefficients will not have a percentage interpretation; we’ll come back to this later.</p>
<p>This isn’t a standard triple-diff. In this model, \(\beta^{border}\) is capturing the absolute effect of MML in border states, and not the differential effect relative to inland states. To see this, compare to:</p>
\[y_{cst} = \beta^{DD} D_{st} + \beta^{DDD} D_{st} \times B_{s} + \gamma_{c} + \gamma_{t} + \varepsilon_{cst}.\]
<p>Here, \(\beta^{DD}\) represents the effect of MML in inland states, and \(\beta^{DDD}\) is the differential effect in border states (relative to the effect in inland states). That is, \(\beta^{inland} = \beta^{DD}\) and \(\beta^{border} = \beta^{DD} + \beta^{DDD}\).
This is perhaps an issue of taste. What I would primarily want to know is whether MML had a larger effect in border states relative to inland states; the absolute effect in border states is secondary.
Hence, I will report results from the second model (although the differences are small, because the inland effect is small: \(\beta^{inland} = \beta^{DD} \sim 0\)).</p>
<p>The authors find that, on average, MML reduces violent crimes by 35 crimes per 100,000 population, but the estimate is not statistically significant (the standard error is 22).
Then, zooming in on the border states, they find a significant reduction of 108 crimes per 100,000 (and a nonsignificant increase of 2.8 in inland states).
There are three border states that legalized medical marijuana: California, New Mexico, and Arizona. (Texas is the remaining border state.)
Splitting up the effect by treated border state, we have a reduction of 34 in Arizona, 144 in California, and 58 in New Mexico.</p>
<p>I don’t really like this “zoom in on the significance” style of research. We can always find significance if we run enough interactions.
And as we zoom in on subgroups, we lose external validity: can we make meaningful predictions for a state or country that was legalizing marijuana and didn’t border on Mexico?
Moreover, the identifying assumptions become harder to believe. When n=3, it’s more plausible that differential shocks are driving the result (compared to n=20, say). That is, it could be that crime was already decreasing in the three border states when they passed MML, and the negative correlation between MML and crime is coincidental.</p>
<!--
Regression weights
==================
The paper first reports the difference-in-differences estimate, for the average effect of MML on all crimes. This is -35, with a standard error of 22, so not significant.
Then they report a triple diff estimate of -107 for the border states, with three stars.
So this -107 is a weighted average of the effect for border states and the effect for inland states.
Using regression weights (link),
doesn't work: should be: -35 = -107*w_border + 2.8 * w_inland
w_in = 1-wb
-35 = -107*w +2.8 - 2.8w
-37.8 = -109.8w
w= 37.8/109.8 = .34, w_inland = .66
so -35 = -107*.34 + 2.8 * .66
regression weights are not designed for this.
The DD estimate is a variance-weighted average of the heterogeneous effects, but with different weights.
-->
<p>Ok, let’s get into the issues.</p>
<h1 id="weighting">Weighting</h1>
<p>The authors use weighted least squares (weighting by population) for their main results.
They justify weighting by performing a Breusch-Pagan test, and finding a positive correlation between the squared residuals and inverse population. This implies larger residuals in smaller counties. In other words, there is heteroskedasticity, and weighting will decrease the size of the standard errors, i.e., increase precision. However, in Appendix Table D7, you’ll note that while they get a positive correlation when using homicides and assaults as the dependent variable, this coefficient is negative and nonsignificant for robberies.
So by the Breusch-Pagan test, the robbery results actually should not be weighted.
And in Table D9, the unweighted robbery estimate has smaller standard errors than the weighted one: weighting is <em>reducing</em> precision.
And yet, the paper still uses weighting when estimating the effect of MML on robberies (in Table 4).
We’ll see below that this makes a big difference for the effect size.</p>
<h1 id="modelling-the-dependent-variable">Modelling the dependent variable</h1>
<p>The authors estimate the effect of MML on crime using a level dependent variable instead of taking the logarithm, which I had thought was standard.
In particular, their main results use the aggregate crime rate, which leads to a “level” interpretation: MML reduces the crime rate by \(\hat{\beta}=\) 108 crimes per 100,000 population.
<!-- In secondary results, they break this into categories: homicides, robberies, and assaults (all in rates per 100,000 population). -->
<!-- Interpret their results in level terms: MML reduces the crime rate by 100 crimes per 100,000 population. --></p>
<p>I would have used a log-level regression, taking \(log(y+1)\) for the dependent variable (adding 1 if there are zeroes in the data), which gives a percentage (or semi-elasticity) interpretation: MML reduces crime by \(100 \times (exp(\hat{\beta})-1) \%\).
<!-- (which is approximately $$\beta$$ when $$abs(\beta)<0.1$$). -->
The paper doesn’t justify why they don’t use a log-level model. This is even more surprising when you see that they manually calculate the semi-elasticity (p.19), again without mentioning the log-level approach.</p>
<!-- see Dell drug war paper for use of log(hom+1) ; doesn't actually use it?-->
<p>After spending some time looking into this question of logging the dependent variable for skewed, nonnegative data, I’m still pretty confused.
It seems the options are: (1) level-level regression, as used in this paper; (2) log-level regression; (3) transforming \(y\) with the inverse hyperbolic sine; and (4) Poisson regression (with robust standard errors, <a href="https://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/">you don’t need to</a> assume mean=variance). But it’s not clear what the “correct” approach is.
I’d expect a true result to be robust across multiple approaches, so let’s try that here.</p>
<p>I estimate the triple-diff model using a level-level regression (to directly replicate the paper), a log-level regression, and a Poisson regression. (The inverse hyperbolic sine approach is almost identical to log-level, so I skip it here.)
To see how the specification matters, I conduct a specification curve analysis using R’s <a href="https://masurp.github.io/specr/">specr</a> package. Specifically, I run all possible combinations of model elements, either including or excluding covariates, population weights, state-specific linear time trends, and border-year fixed effects.
This will allow us to see whether possibly debatable modelling choices, such as state-specific linear trends, are driving the results.
<!-- eg, including time trends and weighting by population, but excluding covariates. --></p>
<p>Here are the homicide results, first in the level-level model (as in the paper).
Panel A plots the coefficient estimates in increasing order, while panel B shows the corresponding specification.
Each specification has two markers in panel B, one in the upper part indicating the model, and one in the lower part indicating whether all or no covariates are included in the model.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
For example, the specification with the most negative estimate is ‘trends + weights, no covariates’.
In both panels, the x-axis is just counting the number of specifications, and the color scheme is: (red, negative and significant), (grey, insignificant), (blue, positive and significant).
The ‘baseline’ specification omits the state-specific trends, border-year fixed effects, and doesn’t weight by population.
I’ll be focusing on the full specification, ‘trends + border + weights, all covariates’, which includes state-specific linear trends, border-year fixed effects, and weights by population.</p>
<h4 id="level-level-model-homicides">Level-level model: homicides</h4>
<p><img src="https://michaelwiebe.com/assets/mml/hom_level.png" alt="" width="80%" />
We can see that the estimate is negative and statistically significant in the full specification, with and without covariates.
Most estimates are nonsignificant; these are generally the unweighted models, indicating the importance of population weighting for these results.</p>
<h4 id="log-level-model-homicides">Log-level model: homicides</h4>
<p><img src="https://michaelwiebe.com/assets/mml/hom_log.png" alt="" width="80%" />
Next, in the log-level model, most estimates are insignificant, including the full specification.
Two models even have positive and significant results (in blue). Let’s see the Poisson model:</p>
<h4 id="poisson-model-homicides">Poisson model: homicides</h4>
<p><img src="https://michaelwiebe.com/assets/mml/hom_pois.png" alt="" width="80%" />
Here I use the homicide count (instead of the rate per 100,000 population), though note that the controls include log population and I’m weighting by population.
<!-- In any case, the results are similar with the homicide rate -->
In this case, the estimates are almost exactly zero and nonsignificant in the full specification.
So, the homicide results only go through using the level-level regression, and not in the log-level or Poisson models.</p>
<p>For the other dependent variables, I’ll show the graphs in the footnotes.
The results for robberies are more robust. The full specification is negative and significant across all three models.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>
However, the assault results are not robust, with the full specification nonsignificant for both log-level and Poisson regressions.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
<p>This doesn’t look great for the paper. I’d expect real effects to be robust across the three models. I conclude that at best, the paper provides evidence for an effect of MML on robberies in border states, but not on homicides or assaults. And this is assuming the event study graph looks good for pretrends, which I’ll discuss next.</p>
<h1 id="event-study">Event study</h1>
<p>There are big trends in crime over this period. <a href="https://www.statista.com/statistics/191219/reported-violent-crime-rate-in-the-usa-since-1990/">Crime fell</a> a lot during the 90s, and again after 2007.
To show that their results aren’t driven by these trends, the authors present an event study graph in Figure 6, estimating a triple-diff coefficient in each year. Basically, this is estimating the triple diff for each year relative to an omitted period.</p>
<p>The authors estimate their main results using the 1994-2012 sample. For the event study, they also use an extended sample from 1990-2012. The extended sample has issues, because it uses flawed imputed data over 1990-1992, and the year 1993 is missing entirely.
<!-- But having more pretreatment years is helpful, because California is treated in 1996, leaving only two years for estimating pretrends in the original sample. -->
Here I will show results from the main sample, 1994-2012.</p>
<p>For their main event study, the authors only include dummies for relative years -2 to 4, and bin all years 5+ in one dummy.
This is because California is treated in 1996 and only has two years of pretreatment data, and wouldn’t contribute to any dummies before -2.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>
But this is a bit of an arbitrary choice. Similarly, Arizona is treated in 2010 and only has two years of post-treatment data, and hence doesn’t contribute to any dummies after +2. So should we include dummies only for [-2,2]?</p>
<p>I think it’s fine to include dummies for [-5,5], with the understanding that some states do not contribute to some estimates. (Specifically, California doesn’t have dummies for -5 to -3, and Arizona doesn’t have dummies for 3 to 5+.)
In this setup, the omitted years are <-5, in contrast to the standard approach of omitting relative year -1. (As noted in the last footnote, California has no omitted years, so the software should drop one year.)</p>
<p>Next I plot my version of their event study graph, using a level dependent variable.
Since this is a triple-diff, I include relative year dummies for the treated states, as well as separate relative year dummies for the treated border states. I plot the coefficients on the border-state relative year dummies.
While the paper only includes dummies for [-2,5+], I estimate coefficients for [-5,5+].<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>
<h4 id="event-study-violent-crimes-binning-5">Event study: violent crimes (binning 5+)</h4>
<p><img src="https://michaelwiebe.com/assets/mml/es_violent_bin.png" alt="" width="75%" /></p>
<p>Compared to the event study in the paper, here the coefficients are all negative. For the pretreatment estimates (treatment occurs in period 0), this means level differences between the treated border and inland states. There is also a slight downward trend before the treatment, hinting at differential trends.
<!-- rewrite!!! -->
<!-- This looks pretty similar, but now the coefficients in -3 *and* -5 are negative and significant. This kind of pretreatment noise doesn't inspire confidence. --></p>
<p>In any case, note that this graph is for the aggregated violent crime variable. Where are the event studies for the individual dependent variables? <em>The authors do not show them!</em> This is a major flaw, and I can’t believe that the referees missed it.
Even if we found no pretrends in the aggregate variable, there could still be pretrends in the component variables. Let’s take a look ourselves.</p>
<h4 id="event-study-homicides-binning-5">Event study: homicides (binning 5+)</h4>
<p><img src="https://michaelwiebe.com/assets/mml/es_hom_bin.png" alt="" width="75%" />
First up, using the homicide rate as the dependent variable, we get a big mess.
There are big movements in years -3 and -2: relative to the treatment year, homicides were higher three years prior, and lower two years prior.
<!-- There's a drop in the coefficient from -3 to -2, indicating a drop in homicides two years before MML was implemented. -->
<!-- There are large negative estimates in relative years -2 and -1, -->
So at least for homicides, it looks like the negative triple-diff estimate could just be picking up noise.
Now we know why the authors didn’t include separate event study graphs by dependent variable.</p>
<h4 id="event-study-robberies-unweighted-binning-5">Event study: robberies, unweighted (binning 5+)</h4>
<p><img src="https://michaelwiebe.com/assets/mml/es_rob_bin_unw.png" alt="" width="75%" />
For robberies, recall that the Breusch-Pagan test failed to justify weighting, so I do not use weights.
Here, it also looks like a negative trend is driving the result: robberies were smoothly decreasing in treated border states before MML was implemented. (See the unweighted graph in the footnote.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup>)
The common trends assumption for the triple-diff appears to be violated.</p>
<h4 id="event-study-assaults-binning-5">Event study: assaults (binning 5+)</h4>
<p><img src="https://michaelwiebe.com/assets/mml/es_ass_bin.png" alt="" width="75%" />
Finally, for assaults, the event study actually doesn’t look bad, although the standard errors are large.
This is a bit surprising, given that the assault results were not robust across log-level and Poisson models.
<!-- we see a similar pattern as violent crimes, but with smaller coefficients. -->
<!-- Recall that 'violent crime' is defined as the sum of homicide, robbery, and assault rates. The averages of these variables are 5, 44, and 265. So clearly the violent crime results will be driven mostly by assaults and robberies, which swamp the null result for homicides. --></p>
<!-- What do these event studies look like using the log-level or Poisson models? I'll throw them in the footnote.[^5]
The homicide results again look like nothing, in both cases.
The robbery graph looks good in Poisson, but has pre-trends in log-level.
Assaults are either flat or have a positive trend. -->
<p>Overall, this doesn’t look good for the paper. I think this is an equally defensible event study method, but it nukes their homicide and robbery results.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>
<!-- Overall, I don't trust these event study results very much. There's clearly no effect for homicides, and the assault results are not robust across models. The robbery results are most promising, but still not great. --></p>
<hr />
<p>I’m not a fan of binning in event studies. In Andrew Baker’s <a href="https://andrewcbaker.netlify.app/2020/06/27/how-to-create-relative-time-indicators/">simulations</a>, binning periods 5- and 5+ performs badly.
In contrast, a fully-saturated model including all relative year dummies (except for relative year -1, which is the omitted year) performs perfectly. So let’s try that here.
<!-- normalizing the above graphs around the -1 estimate; but also changing the estimates, since full set of relative year dummies --></p>
<p>By omitting year -1, we’re basically normalizing the above event study graphs around the -1 estimate (but also changing the estimates, since we’re including all other relative year dummies).
Hence, the homicide graph has the same patterns, but shifted up.
We again find a clear trend in the robbery graph.
But now the assault graph also looks to be driven by trends.</p>
<h4 id="event-study-homicides">Event study: homicides</h4>
<p><img src="https://michaelwiebe.com/assets/mml/es_hom.png" alt="" width="75%" /></p>
<h4 id="event-study-robberies">Event study: robberies</h4>
<p><img src="https://michaelwiebe.com/assets/mml/es_rob.png" alt="" width="75%" /></p>
<h4 id="event-study-assaults">Event study: assaults</h4>
<p><img src="https://michaelwiebe.com/assets/mml/es_ass.png" alt="" width="75%" /></p>
<p>Takeaway: now I really doubt that MML had a causal effect on crime.</p>
<!-- I'm not sure if these graphs are right. Some of the relative-year indicators get dropped due to collinearity, which might affect the interpretation as triple-diff vs double-diff.
In any case, they don't look good. Homicides are noisy before treatment, robberies have a clear pretrend, and assaults have a noisy pretrend. -->
<!-- Bacon-goodman: adding years to sample changes DD estimate: more weight on California, since closer to middle; less weight on Ariz, NM, since closer to end -->
<h1 id="synthetic-control">Synthetic control</h1>
<p>To further dig into these trends, I aggregated the data from county- to state-level and performed a synthetic control analysis for each of the three treated border states: California, Arizona, and New Mexico. This aggregation is probably imperfect, and it would be better to start with state-level data, but let’s see what happens. (Running level-level regressions, I still get negative results, with effect sizes similar to the county-level data. See the specification curves in the footnote. <sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup>)</p>
<p>The idea of synthetic control is to construct an artificial control group for our treated state, so we can evaluate the treatment effect simply by comparing the outcome variable in the treatment and synthetic control states. The synthetic control group is a weighted average of control states, and these weights are chosen to match the treated state on preperiod trends. I use the nevertreated states as the donor pool; I’ll report the weights below.</p>
<p>Here I’ll show the robbery results for the three states (using the level dependent variable), to see what’s happening with that smooth trend.
Note that these graphs are plotting the raw outcome variable, so we’re seeing the actual trends in the data.</p>
<p><img src="https://michaelwiebe.com/assets/mml/sc_cali_rob.png" alt="" width="80%" />
California’s synthetic control is 68% New York and 28% Minnesota.
California’s MML occurs in the middle of the 1990s crime decrease, and it doesn’t look like there’s much of an effect in 1996.
<!-- Recall that California only has two years of pretreatment data, so there's not much to match on. --></p>
<p><img src="https://michaelwiebe.com/assets/mml/sc_ariz_rob.png" alt="" width="80%" />
Arizona’s synthetic control is 61% Texas, 24% Florida, and 15% Wyoming.
Again, there doesn’t seem to be a treatment effect.</p>
<p><img src="https://michaelwiebe.com/assets/mml/sc_nmex_rob.png" alt="" width="80%" />
New Mexico’s synthetic control is 51% Mississippi, 21% Louisiana, 18% Texas, and 7% Wyoming.
Its MML occurs before a drop in robberies that is partly matched by the synthetic control group.</p>
<p>You can look at the other synthetic control graphs in this footnote.<sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote" rel="footnote">9</a></sup></p>
<p>Overall, I worry that these three states coincidentally legalized medical marijuana when crime was high and falling, and that the triple-diff estimates are just picking up these trends.
Based on my analysis here, I don’t believe that medical marijuana legalization reduced crime in the US.</p>
<h1 id="randomization-inference">Randomization inference</h1>
<p>One final note: the paper calculates a (one-sided) randomization inference p-value of 0.03, and claims that this is evidence for their result being real.
However, as I discuss in <a href="https://michaelwiebe.com/blog/2021/01/randinf">this post</a>, this claim is false. With large sample sizes, there’s no reason to expect RI and standard p-values to differ, so a significant RI p-value provides no additional evidence.</p>
<h1 id="conclusion">Conclusion</h1>
<p>I think it’s plausible that moving marijuana production from the black market to the legal market would reduce crime (at least in the long run).
But the effect of medical marijuana legalization on crime is too small to detect in the data.</p>
<!-- These seem like severe problems for a paper published in Economic Journal. How did it get through peer review?
The authors present a formal supply and demand model and report several extensions, finding stronger reductions in crime for counties closer to the border, and that MML in border-adjacent inland states reduces crime in the border state.
Perhaps the referees were awed by the edifice in front of them, and only requested small robustness checks instead of questioning the foundational results. -->
<!-- greasy-->
<!-- - using level depvar, not doing log for semi-elasticity -->
<!-- - weighting robbery results, when not justified -->
<!-- - using aggregated depvar, not doing event study separately by category -->
<p><!-- - also allowing them to weight, since rejected BP with aggregate crime variable -->
<!-- - making up RI bullshit --></p>
<hr />
<h2 id="footnotes">Footnotes</h2>
<p>See <a href="https://github.com/maswiebe/metrics/blob/main/mml_replication.r">here</a> for R code, and <a href="https://michaelwiebe.com/assets/mml/gkz_data.zip">here</a> for the original replication files. (For some reason, the replication files aren’t online anymore.)</p>
<p>PS: Table 5 does heterogeneity by type of homicide; I’d be curious to see the event study for each of these outcomes.</p>
<!-- Log-level results:
#### Event study (log-level): homicides, (binning 5+)
![](https://michaelwiebe.com/assets/mml/es_lhom_bin.png){:width="80%"}
#### Event study (log-level): robberies, unweighted (binning 5+)
![](https://michaelwiebe.com/assets/mml/es_lrob_bin.png){:width="80%"}
#### Event study (log-level): assaults, (binning 5+)
![](https://michaelwiebe.com/assets/mml/es_lass_bin.png){:width="80%"}
Poisson results:
#### Event study (Poisson): homicides, (binning 5+)
![](https://michaelwiebe.com/assets/mml/es_phom_bin.png){:width="80%"}
#### Event study (Poisson): robberies, unweighted (binning 5+)
![](https://michaelwiebe.com/assets/mml/es_prob_bin.png){:width="80%"}
#### Event study (Poisson): assaults, (binning 5+)
![](https://michaelwiebe.com/assets/mml/es_pass_bin.png){:width="80%"} -->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>The full covariate list is: an indicator for decriminalization, log median income, log population, poverty rate, unemployment rate, and the fraction of males, African Americans, Hispanics, ages 10-19, and ages 20-24. In general, I find that adding controls barely changes the \(R^{2}\), so these variables aren’t adding much beyond the county and year fixed effects. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Robbery results:</p>
<h4 id="level-level-model-robberies">Level-level model: robberies</h4>
<p><img src="https://michaelwiebe.com/assets/mml/rob_level.png" alt="" width="80%" />
In the level-level model, we see a big difference between the weighted and unweighted results. Clearly, there are heterogeneous treatment effects, with larger effects in the higher-weight states (California, probably).
As I noted above, the robbery estimates should not be weighted.</p>
<h4 id="log-level-model-robberies">Log-level model: robberies</h4>
<p><img src="https://michaelwiebe.com/assets/mml/rob_log.png" alt="" width="80%" /></p>
<h4 id="poisson-model-robberies">Poisson model: robberies</h4>
<p><img src="https://michaelwiebe.com/assets/mml/rob_pois.png" alt="" width="80%" /> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Assault results:</p>
<h4 id="level-level-model-assaults">Level-level model: assaults</h4>
<p><img src="https://michaelwiebe.com/assets/mml/ass_level.png" alt="" width="80%" /></p>
<h4 id="log-level-model-assaults">Log-level model: assaults</h4>
<p><img src="https://michaelwiebe.com/assets/mml/ass_log.png" alt="" width="80%" /></p>
<h4 id="poisson-model-assaults">Poisson model: assaults</h4>
<p><img src="https://michaelwiebe.com/assets/mml/ass_pois.png" alt="" width="80%" /> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>One problem with this specification is that California has no omitted years. Every year from 1994-2012 has a dummy variable, which seems like a dummy variable trap (i.e., multicollinearity). Specifically: 1994-2000 are covered by dummies for -2 to 4, and 2001-2012 are covered by the 5+ binned dummy. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>Moreover, as noted above, I am estimating the differential effect of MML in border states relative to inland states, while GKZ are estimating the absolute effect. I also drop counties that have the black share of population greater than 100%. It seems the authors were doing some extrapolation that got out of control. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>We shouldn’t care about this graph, because weighting is unwarranted.</p>
<h4 id="event-study-robberies-weighted-binning-5">Event study: robberies, weighted (binning 5+)</h4>
<p><img src="https://michaelwiebe.com/assets/mml/es_rob_bin.png" alt="" width="80%" /> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>It’s depressing that event studies can differ so much based on slight model changes. I have a feeling that a lot of diff-in-diffs from the past twenty years are not going to survive replication. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:8" role="doc-endnote">
<p>Specification curve for state-level results:</p>
<h4 id="level-level-model-homicides">Level-level model: homicides</h4>
<p><img src="https://michaelwiebe.com/assets/mml/s_hom_level.png" alt="" width="80%" /></p>
<h4 id="level-level-model-robberies-1">Level-level model: robberies</h4>
<p><img src="https://michaelwiebe.com/assets/mml/s_rob_level.png" alt="" width="80%" /></p>
<h4 id="level-level-model-assaults-1">Level-level model: assaults</h4>
<p><img src="https://michaelwiebe.com/assets/mml/s_ass_level.png" alt="" width="80%" /> <a href="#fnref:8" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:9" role="doc-endnote">
<p>Synthetic control results for homicides and assaults.
<img src="https://michaelwiebe.com/assets/mml/sc_cali_hom.png" alt="" width="80%" />
<img src="https://michaelwiebe.com/assets/mml/sc_ariz_hom.png" alt="" width="80%" />
<img src="https://michaelwiebe.com/assets/mml/sc_nmex_hom.png" alt="" width="80%" />
<img src="https://michaelwiebe.com/assets/mml/sc_cali_ass.png" alt="" width="80%" />
<img src="https://michaelwiebe.com/assets/mml/sc_ariz_ass.png" alt="" width="80%" />
<img src="https://michaelwiebe.com/assets/mml/sc_nmex_ass.png" alt="" width="80%" /> <a href="#fnref:9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
How I use regression weights to replicate research2021-02-25T20:00:00+00:00http://michaelwiebe.com/blog/2021/02/cook_violence<p>One of the main tools I use for replication is <a href="https://sci-hub.st/https://onlinelibrary.wiley.com/doi/abs/10.1111/ajps.12185">regression weights</a>. These show the weight that each observation contributes to a regression coefficient.
Suppose we’re regressing \(y\) on \(X_{1}\) and \(X_{2}\), with corresponding coefficients \(\beta_{1}\) and \(\beta_{2}\).
Then, the regression weights for \(\beta_{1}\) are the residuals from regressing \(X_{1}\) on \(X_{2}\), which represent the variation in \(X_{1}\) remaining after controlling for \(X_{2}\).
From <a href="https://en.wikipedia.org/wiki/Frisch%E2%80%93Waugh%E2%80%93Lovell_theorem">Frisch-Waugh-Lovell</a>, we know that \(\beta_{1}\) can be estimated by regressing \(y\) on these residuals.
Hence, the regression weights show the actual variation used in the estimate.
When replicating a paper, looking at regression weights is a handy way to see what’s actually driving the result.</p>
<p>In this post, I’ll give a quick demo of regression weights, looking at <a href="https://link.springer.com/article/10.1007/s10887-014-9102-z">Cook (2014)</a> (<a href="https://twitter.com/sci_hub_">sci-hub</a>) (<a href="https://link.springer.com/article/10.1007/s10887-014-9102-z#Sec20">replication files</a>) on the effect of racial violence on African American patents over 1870-1940.
This paper starts with striking time series data on patents by African American inventors.
In Figure 1, we see a big drop in black patents around 1900. What is driving this pattern?</p>
<p><img src="https://michaelwiebe.com/assets/cook_violence/fig1.png" alt="" width="80%" /></p>
<p>Cook argues that race riots and lynchings cause reduced patenting, directly by intimidating inventors, and indirectly by undermining trust in intellectual property laws (if the government won’t punish race rioters, why should you believe it’ll enforce your patents?).</p>
<p>Table 7 contains the main state-level regressions of patents on lynching rates and riots.
Using a random-effects model, Cook finds negative effects for both lynchings and riots.
I find similar results with a fixed effects model.</p>
<p>Let’s do regression weights, first for the lynching result.
I regress lynchings on the other variables, grab the residuals, square them, then normalize by the sum of squared residuals.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>* Stata code:
use pats_state_regs_AAonly, clear
reghdfe lynchrevpc riot seglaw illit blksh regs regmw regne regw , ab(stateno year1910 year1913 year1928) vce(cl stateno) res(resid)
gen res1 = resid^2
egen resid_tot = total(res1)
gen regweight = res1/resid_tot
</code></pre></div></div>
<p>Next, let’s see how these weights vary by region.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>table region, c(sum regweight count patent)
</code></pre></div></div>
<p><img src="https://michaelwiebe.com/assets/cook_violence/regweight_lynch.png" alt="" width="55%" /></p>
<p>This is a bit surprising. The South has 81% of the weight, with the remainder coming from the West. The other three regions have basically zero contribution to the lynchings coefficient.</p>
<p>So let’s see what’s happening in the data.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>table region, c(mean lynchrevpc)
</code></pre></div></div>
<p><img src="https://michaelwiebe.com/assets/cook_violence/table_lynch.png" alt="" width="40%" /></p>
<p>It turns out that basically all lynchings occurred in the South and West, with zero in the Midwest and Northeast (and roughly zero in Mid-Atlantic).
Given this, the regression weights make sense.
When there’s no variation in a variable, it should contribute nothing to the regression. But because the Midwest and Northeast have data on the other covariates, they still add some information, which is why the weights aren’t exactly zero.</p>
<hr />
<p>Next, let’s see the results for the effect of riots on patenting. First, the regression weights, regressing riots on the other controls:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>reghdfe riot lynchrevpc seglaw illit blksh regs regmw regne regw , ab(stateno year1910 year1913 year1928) vce(cl stateno) res(resid2)
gen res2 = resid2^2
egen resid_tot2 = total(res2)
gen regweight2 = res2/resid_tot2
table region, c(sum regweight2 count patent)
</code></pre></div></div>
<p><img src="https://michaelwiebe.com/assets/cook_violence/regweight_riot.png" alt="" width="55%" /></p>
<p>Again, the regional patterns are surprising.
This time, the South has 27% of the weight, and the Mid-Atlantic has 73%, with the other regions contributing nothing.
What’s going on?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gen region = .
replace region = 1 if (regs)
replace region = 2 if (regmw)
replace region = 3 if (regne)
replace region = 4 if (regw)
replace region = 5 if (regmatl)
label define reg_label 1 "South" 2 "Midwest" 3 "Northeast" 4 "West" 5 "Mid-Atlantic"
label values region reg_label
table region, c(sum riot)
</code></pre></div></div>
<p><img src="https://michaelwiebe.com/assets/cook_violence/table_riot.png" alt="" width="36%" /></p>
<p>It turns out there are only 5 riots in the state-level data.
Let’s dig deeper.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>table stateno, c(sum regweight2 count patent)
table year, c(sum regweight2 count patent)
* output omitted
</code></pre></div></div>
<p>The regression weight is concentrated on three states: state 39 has 58%, state 44 has 27%, and state 33 has 15% (the names are not in the data).
It’s also concentrated on four years: 56% on 1917, 12% on 1918, 15% on 1900, 12% on 1906.
This is because there are five riots occurring in four years, with two in 1917 in state 39, two in state 44 in different years, and one in state 33.
So the riot effect is driven almost entirely by the four state-year observations that had riots.</p>
<p>But wait. If you look, you’ll see that there are 35 riots in the time-series data.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>use pats_time_series, clear
collapse (sum) riot if race==0
su riot
</code></pre></div></div>
<p>Where did the other riots go?
It looks like the state data just has a lot of missing observations, which would explain the missing riots.
That is, the issue isn’t variables with missing values, but that most state-year observations do not even have a row in the data.
(I emailed Cook to ask about this, but didn’t get a response.)
As you can see, the sample size fluctuates over time; this is far from a balanced panel.</p>
<p><img src="https://michaelwiebe.com/assets/cook_violence/sample_size.png" alt="" width="85%" /></p>
<p>Note that 1917 and 1918 have the majority of the weight, but there are only two observations in each of those years.</p>
<p>The paper is not very clear about this. Table 5 reports the descriptive stats, but only has the riots variable for the time-series data, and not the state-level data. And Cook does not plot any of the raw state-level data, but instead jumps right into the regressions.</p>
<p>This seems like a serious problem for the riot results. The paper isn’t estimating the effect of riots on patenting; instead, it’s doing the effect of five specific riots. If we could collect data on the remaining 30 riots, I’d expect the estimate to change.
In other words, why should we expect this result to be externally valid for other historical riots?</p>
<hr />
<p>To sum up, regression weights are an easy way to dig into a paper and see exactly what’s driving their results.</p>
<p>Happy replicating!</p>
Does meritocratic promotion explain China's growth?2021-02-05T20:00:00+00:00http://michaelwiebe.com/blog/2021/02/meritocracy<p>One explanation for China’s rapid economic growth is meritocratic
promotion, where politicians with higher GDP growth are rewarded with
promotion. In this system, politicians compete against each other in
‘promotion tournaments’ where the highest growth rate wins. This
competition incentivizes politicians to grow the economy, and hence
helps explain the stunning economic rise of China.</p>
<p>The literature on meritocratic promotion finds evidence of meritocracy
for province, prefecture, and county leaders.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> However, as I discuss
in my <a href="https://michaelwiebe.com/assets/ch1.pdf">dissertation</a>, the evidence for province and prefecture leaders is
weak. In the provincial literature, the initial positive finding was not
confirmed in follow-up studies. And when I <a href="https://michaelwiebe.com/blog/2021/02/replications">replicated</a> the prefecture
literature, I found that the results there were not robust. So we don’t
have strong evidence that province and prefecture leaders are promoted
based on GDP growth. But, using data from two papers, I did find some
evidence for meritocratic promotion of county leaders (details <a href="https://michaelwiebe.com/assets/ch2.pdf">here</a>).</p>
<p>So how should we think about meritocracy in China? Despite the lack of
evidence for meritocratic promotion at the province and prefecture
levels, it’s still plausible that meritocracy has contributed to China’s
growth. Let’s grant that county leaders are promoted meritocratically,
directly incentivizing them to boost GDP growth.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> This means that
high-growth county leaders are promoted to prefecture positions. But
since prefecture leaders then consist only of high-growth leaders, there
isn’t enough variation in growth to implement a prefecture-level
promotion tournament. In other words, range restriction prevents the
Organization Department from implementing meritocratic promotion above
the county level. Running a successful county-level promotion tournament
precludes prefecture and provincial tournaments. Hence, the Organization
Department must use other criteria in determining promotions of
prefecture and provincial leaders.</p>
<p>So county leaders are continuously incentivized to boost economic
growth, and only leaders with demonstrated growth-boosting ability are
promoted to prefecture and provincial positions. While they are not
directly incentivized, these prefecture and province leaders are
selected based on their ability to grow the economy, and they supervise
the county leaders in their prefecture/province. We can think of this as
a version of partial meritocracy, in contrast to a ‘maximal’ version
where leaders at all levels are incentivized through promotion
tournaments. While the maximal version provides the strongest incentives for
boosting GDP growth, the partial version does generate some incentives
as well.</p>
<p>Thus, despite the lack of evidence at higher levels of government,
meritocracy does partly explain China’s economic growth.</p>
<h2 id="footnotes">Footnotes</h2>
<p>Read my papers on meritocratic promotion: <a href="https://michaelwiebe.com/assets/ch1.pdf">null result</a> and <a href="https://michaelwiebe.com/assets/ch2.pdf">replications</a>.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>There are six administrative levels in the Chinese government:
center, province, prefecture, county, township, and village. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Based on my experience replicating the prefecture literature, we
should wait to see more evidence before drawing firm conclusions for
county-level meritocracy (e.g., extending the sample period, trying
different promotion definitions). <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Replicating the literature on meritocratic promotion in China2021-02-04T20:00:00+00:00http://michaelwiebe.com/blog/2021/02/replications<p>China has had double-digit economic growth for nearly three decades. How
can we explain this? In my dissertation, I studied one explanation that
is backed up by a large literature: meritocratic promotion. The idea is
that politicians compete in promotion tournaments, where the politician
with the highest GDP growth rate in their jurisdiction is rewarded by
being promoted. By tying promotion to economic growth, meritocratic
promotion creates strong incentives to boost GDP, and hence helps
explain China’s rapid growth.</p>
<p>When I collected data on prefecture politicians, however, I found no evidence for
meritocracy: there was no correlation between GDP growth and
promotion, despite trying many different models. How is this null result
consistent with the positive findings in the rest of the literature? To
find out, I replicated the main papers claiming evidence for prefecture-level meritocracy. Short answer:
the literature is wrong.</p>
<p>This post summarizes my replications. I find that the results in the
literature are not robust to reasonable specification changes, or are
due to data errors. You can find the full details, and a few more
replications, in the paper <a href="https://michaelwiebe.com/assets/ch2.pdf">here</a>.</p>
<h1 id="yao-and-zhang-2015">Yao and Zhang (2015)</h1>
<p><a href="https://sci-hub.st/https://doi.org/10.1007/s10887-015-9116-1">Yao and Zhang (2015)</a>, published in the Journal of Economic Growth, was
the first paper to study meritocratic promotion at the prefecture level
in China. They estimate a leader’s ability to grow GDP, and then estimate
the relationship between ability and promotion. If promotion is
meritocratic, we should see a positive correlation, as high-growth
leaders are promoted.</p>
<p>However, they find no average correlation between leader ability and
promotion: leaders with higher ability are not more likely to be
promoted. Despite this, the authors do not frame their paper as
contradicting the literature.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> Moreover, this paper is cited in the
literature as supporting the meritocracy hypothesis.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<p>This is because the authors further test for an interaction between
leader ability and age, reporting a positive interaction effect that is
significant at the 5% level. Narrowing in on specific age thresholds,
they find that leader ability has the strongest effect on promotion for
leaders older than 51. They conclude that leader ability matters for
older politicians, because more years of experience produces a clearer
signal of ability.</p>
<p>Now, this result is consistent with a limited promotion tournament,
where the Organization Department promotes older leaders based on their
ability to boost growth (because older leaders have clearer signals of
ability), but applies different promotion criteria to younger leaders
(whose signals are too weak to detect). But this limited model
contradicts the usual characterization of China’s promotion tournament
as including all leaders, irrespective of age: in each province, leaders
compete to boost GDP growth, and the winners are rewarded with
promotion.</p>
<p>This is actually a big discrepancy, because half of all promotions occur
for leaders younger than 51. If the Organization Department cannot
measure ability for these young leaders, what criteria does it use to
promote them? Furthermore, remember that the original motivation was to
explain China’s rapid growth. The incentives generated by this limited
tournament are weaker, since the reward is only applied later in life;
if young leaders are impatient, they will discount this future reward
and put less effort into boosting growth. The limited tournament model
has less explanatory power.</p>
<p>At this point, it is not clear to me why this paper has been cited
without qualification as evidence for meritocratic promotion. It offers
no general support for meritocracy, and its model of a limited promotion
tournament partly contradicts the literature.</p>
<p>But I’m not stopping here. Finding a null average effect with a
significant interaction is a classic formula for p-hacked results in
social psychology. Since the age interaction doesn’t make much sense, I
don’t believe that the authors started out planning to run this test.
Rather, it looks like they wanted to find a positive average effect, but
didn’t. But they’d already invested a lot of time in collecting the data
and working out a clever identification strategy, so they found an
interaction that got them statistical significance, even if the
interpretation wasn’t really consistent. Hence, I <a href="https://michaelwiebe.com/blog/2020/11/pvalues">reject their p-value
as invalid</a>.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
<p>And it turns out that this is the right call. Digging into the paper, I
find that the significant interaction term depends on including
questionable control variables.</p>
<p>When estimating leader ability, the authors regress GDP growth on three
fixed effects (leader, city, year) as well as three covariates: initial
city GDP per capita (by leader term), annual city population, and the
annual provincial inflation rate. I think it makes sense to control for
initial GDP by term. The model includes city effects, so level
differences in growth rates are not an issue. But we might worry that
the variance of idiosyncratic shocks to growth is correlated with city
size, and growth shocks could affect promotion outcomes.</p>
<p>However, it is not clear why population and inflation should be
included. The authors mention that labor migration can drive GDP growth
(p.413), but a leader’s policies affect migration, so population is
plausibly a collider or ‘bad control’, if leader ability affects growth
through good policies that increase migration. The authors provide no
justification for including inflation, which is odd because the
dependent variable (real per capita GDP growth) is already expressed in
real (rather than nominal) terms.</p>
<p>Given the lack of justification for including population and inflation
as covariates, I re-estimate leader ability controlling only for initial
GDP. Using this new estimate of ability, I then replicate their main
results. I again find a nonsignificant average effect of ability on
promotion. But now the interaction with age disappears. The sign remains
positive, but the magnitude of the coefficient drops by half, and the
results are nonsignificant.</p>
<p>So it turns out that Yao and Zhang (2015) offers no evidence for
meritocratic promotion of prefecture leaders.</p>
<h1 id="li-et-al-2019">Li et al. (2019)</h1>
<p><a href="https://sci-hub.st/https://doi.org/10.1093/ej/uez018">Li et al. (2019)</a>, published in the Economic Journal, studies GDP growth
targets and promotion tournaments in China. They start with the
observation that growth targets are higher at lower levels of the
administration; for example, prefectures set higher targets than do
provinces. Their explanation is that the number of jurisdictions
competing in each promotion tournament is decreasing as one moves down
the hierarchy, which increases the probability of a leader winning the
tournament. As a consequence, leaders exert more effort, and
higher-level governments can set higher growth targets without causing
leaders to quit.</p>
<p>As part of their model, they assume that promotion is meritocratic:
performance (measured by GDP growth) increases the probability of
promotion. Further, they report an original result: the effect of
performance on promotion is increasing in the growth target faced. That
is, a one percentage-point increase in growth will increase a mayor’s
chances of promotion by a larger amount when the provincial target
is higher, relative to when the target is lower.</p>
<p>This result seems naturally testable by interacting
\(Growth \times Target\) in a panel regression, with a predicted positive
coefficient on the interaction term. However, the authors argue that OLS
is invalid, instead reporting results based on maximum likelihood where
promotion is determined by a contest success function. Why does OLS not
apply? “Standard linear regression does not work here partly because
promotion is determined by local officials’ own growth rates as well as
by the growth rates of their competitors. The nonlinearity of the
promotion function is another factor that invalidates the OLS
estimation.” (p.2906)</p>
<p>But these are not problems for OLS. First, as is standard in this
literature, the promotion tournament can be captured by using prefecture
growth rates relative to the annual provincial growth rate. Second, OLS
is the best linear approximation to a nonlinear conditional expectation
function. So if there is a positive nonlinear relationship between
promotion and growth, we should be expect that it will be detected by
OLS.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>
<p>Given the lack of justification for omitting results from linear
regression, I replicate their results using a linear probability model
and logistic regression. First, I test the generic meritocracy
hypothesis. I find that GDP growth has no average effect on promotion.
Next, I do find a positive interaction effect between growth and growth
target, but it’s not statistically significant.</p>
<p>This doesn’t look good for the authors. OLS is the default method, and
you need a strong justification for not reporting it. But their reasons
are flimsy. Now it looks like they tried OLS, didn’t get the result they
wanted, then made up a complicated maximum likelihood model that
delivered significance.</p>
<p>So Li et al. (2019) is another paper that claims to provide evidence for
meritocratic promotion of prefecture leaders, but is unable to back up
those claims.</p>
<h1 id="chen-and-kung-2019">Chen and Kung (2019)</h1>
<p><a href="https://sci-hub.st/https://doi.org/10.1093/qje/qjy027">Chen and Kung (2019)</a>, published in the Quarterly Journal of Economics,
studies land corruption in China, with secondary results on meritocratic
promotion. The main result is that local politicians provide price
discounts on land sales to firms connected to Politburo members, and
these local politicians are in turn rewarded with promotion up the
bureaucratic ladder.</p>
<p>For provincial leaders, they find a strong effect of land sales on
promotion for secretaries, but not for governors. In contrast, GDP
growth strongly predicts promotion for governors, but not secretaries.
They conclude that “the governor has to rely on himself for promotion,
specifically by improving economic performance or GDP growth in his
jurisdiction [...] only the provincial party secretaries are being
rewarded for their wheeling and dealing".</p>
<p>They find similar results at the prefecture level: land deals predict
promotion for secretaries, but not for mayors, while GDP growth predicts
promotion for mayors, but not for secretaries. Overall, this supports
the model of party secretaries being responsible for social policy,
while governors (and mayors) are in charge of the economy, with
performance on these tasks determining promotion. Thus, at both province
and prefecture levels, government leaders (governors and mayors) compete
in a promotion tournament based on GDP growth, while party secretaries
do not.</p>
<p>However, Chen and Kung (2019)’s results for prefecture mayors are
questionable, because their promotion data seems wrong. In my data, the
annual promotion rate varies from 5 to 30% (peaking in Congress years),
while the Chen and Kung (2019) data never exceeds 15% and has six years
where the promotion rate is less than 2%. Figure 1 compares the annual promotion rate
from Chen and Kung to my own data as well as the data from Yao and Zhang
(2015) and Li et al. (2019), where each paper uses a binary promotion
variable (and data on prefecture mayors). While the latter three sources broadly agree on the promotion
rate, the Chen and Kung data is a clear outlier. This is obviously
suspect.</p>
<p><img src="https://michaelwiebe.com/assets/replication/promotion_all_raw_unbalanced.png" alt="" width="95%" /></p>
<p>Furthermore, upon investigating this discrepancy, I discovered apparent
data errors in their promotion variable. The annual promotion variable
is defined to be 1 in the year a mayor is promoted, and 0 otherwise.
However, out of the 201 cases with \(Promotion=1\), 124 occur <em>before</em> the
mayor’s last year in office (with the remaining 77 cases occuring in the
last year). Moreover, this variable is equal to 1 multiple times per
spell in 4% of leader spells. Out of 1216 spells, 51 spells have
\(Promotion=1\) more than once per spell. For example, consider a mayor
who is in office for five years and then promoted; the promotion
variable should be 0 in the first four years, then 1 in the final year.
However, the Chen and Kung data has spells where the promotion variable
is, for example, 0 in the first two years, and 1 in the final three
years.</p>
<p>To fix this error, I obtained the raw mayor data from James Kung, and
used it to generate a corrected annual promotion variable, which is 1
only in a mayor’s final year in office (when the mayor is promoted).
This data-coding error more than <em>doubles</em> the number of promotions. But
since the Chen and Kung promotion rate is smaller than the rest of the
literature, fixing the data errors in fact makes the disagreement with
the literature even <em>more</em> pronounced.</p>
<p>So this promotion data looks pretty lousy. Naturally, we should worry
that their data is driving their finding of meritocratic promotion for
prefecture mayors. To test this, I re-run their analysis using my own
promotion data. I find that the correlation between GDP growth and
promotion is now negative and nonsignificant. So just like the other two
papers, Chen and Kung (2019) also fails to provide evidence for
meritocratic promotion of prefecture leaders.</p>
<p>This is extremely suspicious. Speculating, it looks like the authors had a
nice paper using provincial data, but a referee asked them to extend it
to prefecture leaders. To fit their story, they needed to find an effect
of land sales for secretaries (but not mayors), and an effect of GDP
growth for mayors (but not secretaries). But maybe the data didn’t agree,
and their RA had to falsify the mayor promotion data to get the ‘correct’ result.
This wouldn’t be easy for referees to spot, since the replication files didn’t include spell-level data.
But how else did they collect such error-ridden data that also just
happened to produce results consistent with their story?</p>
<h1 id="conclusion">Conclusion</h1>
<p>The original study of meritocratic promotion for provincial leaders, <a href="https://sci-hub.st/https://doi.org/10.1016/j.jpubeco.2004.06.009">Li
and Zhou (2005)</a>, has been cited over 2500 times. But follow-up work has
repeatedly failed to confirm its finding of a positive correlation
between provincial GDP growth and promotion.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup> And as I have shown in
this post, attempts to extend the meritocracy story down to prefecture
leaders have also failed.</p>
<p>How did this happen? How could a whole literature get this wrong?</p>
<p>Here’s my guess: researchers set a strong prior based on the provincial
result in Li and Zhou (2005), combined with the elegance of the
theoretical model of a promotion tournament. Since the idea of a
promotion tournament is generic, researchers naturally expected it to
apply to prefecture and county politicians as well. In short,
researchers doing follow-up work knew that they had to confirm the
original results.</p>
<p>However, when they studied prefecture leaders and didn’t find a positive
correlation between growth and promotion, the researchers had to fiddle
around with their models and data until they got a result that matched
the original. And given the multiplicity of design choices<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup>, it
wasn’t that difficult to find a specification that yielded statistical
significance.</p>
<p>But why not embrace the null result and contradict the literature? After
all, this is a case where a null result would be interesting, with
adequate statistical power and a well-established consensus. I guess it
was just easier to shoehorn their results to fit in with the literature,
and get the publication, rather than challenge the consensus.</p>
<p>My conclusion is that publication incentives, conformism, and inadequate
peer review led to a literature of false results.</p>
<h2 id="footnotes">Footnotes</h2>
<p>Read the full paper <a href="https://michaelwiebe.com/assets/ch2.pdf">here</a>.
My null result paper is <a href="https://michaelwiebe.com/assets/ch1.pdf">here</a>.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>“We also improve on the existing literature on the promotion
tournament in China. Using the leader effect estimated for a
leader’s contribution to local growth as the predictor for his or
her promotion, we refine the approach of earlier studies.” (Yao and
Zhang 2015, p.430) <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>For example, Chen and Kung (2016): “those who are able to grow
their local economies the fastest will be rewarded with promotion to
higher levels within the Communist hierarchy [...] Empirical
evidence has indeed shown a strong association between GDP growth
and promotion ([...] Yao and Zhang, 2015)". <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>In a <a href="https://michaelwiebe.com/blog/2020/11/pvalues">previous post</a>, I discussed how p-values involve the thought experiment of running the exact same test on many samples of data.
When designing a test, researchers need to follow a procedure that is consistent with this thought experiment. In particular, they need to design the test independently of the data; this guarantees that they would run the same test on different samples.
As <a href="https://stat.columbia.edu/~gelman/research/published/ForkingPaths.pdf">Gelman and Loken</a> put it: “For a p-value to be interpreted as evidence, it requires a strong claim that the same analysis would have been performed had the data been different.”</p>
<p>As it happens, Yao has recently posted a <a href="https://www.semanticscholar.org/paper/The-Competence-Loyalty-Tradeoff-in-China%E2%80%99s-Wang-Yao/e43c2d1adff340d9c79ba15da6071f7f913a61d6">working paper</a> re-using the method in Yao and Zhang (2015).
Like the first paper, the new one also studies how ability affects promotion for prefecture-level leaders, using the same approach to estimate leader effects. Importantly, they update their data on prefecture cities by extending the time series from 2010 to 2017.
Thus, we have a perfect test case to see whether the same data-analysis decisions would be made when studying the same question and using a different dataset (drawn from the same population).</p>
<p>It turns out that the new paper doesn’t interact with age at all!
Instead, it reports the average effect of ability on promotion, which is now significant, along with a new specification where ability is interacted with political connections (see Table 2).
So the p-value requirement is not satisfied: the researcher performs different analyses when the data is different.
Hence, our skepticism of original age interaction turns out to be justified.
Since the researcher would not run the same test on new samples, the significant p-value is actually invalid and does not count as evidence. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>One of the authors, Li-An Zhou, was also an author on the first
paper on meritocratic promotion, Li and Zhou (2005). That paper used
an ordered probit model, so it is curious that they didn’t employ
the same model again here. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>Su et al. (2012) claims that the results in Li and Zhou (2005)
don’t replicate, after fixing data errors. Shih et al. (2012) finds
that political connections, rather than economic growth, explain
promotion. Jia et al. (2015) finds no average effect, but does
report an interaction effect with political connections. Sheng
(2020) finds a meritocratic effect, but only for provincial
governors during the Jiang Zemin era (1990-2002). In my
dissertation, I replicate this paper using the data from Jia et al.
(2015); I find no effect. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>Here are a few of the <a href="https://stat.columbia.edu/~gelman/research/published/ForkingPaths.pdf">researcher degrees of freedom</a> available when studying meritocratic promotion: promotion definitions; growth
definitions (annual vs. cumulative average vs. average GDP growth,
absolute vs. relative GDP growth [relative to predecessor vs.
relative to provincial average vs. relative to both], real vs.
nominal GDP, level vs. per capita GDP); regression models (LPM vs.
probit/logit vs. ordered probit/logit vs. AKM leader effects vs. MLE
with contest success function vs. proportional hazards model);
interactions (with age, political connections [hometown vs. college
vs. workplace], provinces of corrupt politicians, time period); data
construction (annual vs. spell-level), and so on. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Dropping 1% of the data kills false positives2021-01-26T20:00:00+00:00http://michaelwiebe.com/blog/2021/01/amip<p>How robust are false positives to dropping 1% of your sample? Turns out, not at all.</p>
<p>Rachael Meager and co-authors have a <a href="https://twitter.com/economeager/status/1338525095724261378">paper</a> with a new robustness metric based on dropping a small fraction of the sample.
It’s called the Approximate Maximum Influence Perturbation (AMIP).
Basically, their algorithm finds the observations that, when dropped, have the biggest <a href="https://en.wikipedia.org/wiki/Influential_observation">influence</a> on an estimate.
It calculates the smallest fraction required to change an estimate’s significance, sign, and both significance and sign.
In other words, if you have a significant positive result, it calculates the minimum fractions of data you need to drop in order to (1) kill significance, (2) get a negative result, and (3) get a significant negative result.
The intuition here is to check whether there are influential observations that are driving a result.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
And influence is related to the signal-to-noise ratio, where the signal is the true effect size and the noise is the relative variance of the residuals and the regressors.</p>
<p>In a <a href="https://michaelwiebe.com/blog/2021/01/phack">previous post</a>, I explored how p-hacked false positives can be robust to control variables.
In this post, I want to see how p-hacked results fare under this new robustness test.</p>
<h2 id="robustness-of-true-effects">Robustness of true effects</h2>
<p>First, let’s show that real effects are robust to dropping data.
I generate data according to:</p>
\[\tag{1} y_{i} = \beta X_{i} + \varepsilon_{i},\]
<p>where \(X\) and \(\varepsilon\) are each distributed \(N(0,1)\).
I then apply the AMIP algorithm.
I repeat this process for different values of \(\beta\), and the results are shown in Figure 1.</p>
<p><img src="https://michaelwiebe.com/assets/amip/true_b.png" alt="" width="100%" /></p>
<p>We see that as the effect size increases, the fraction of data needed to be dropped in order to flip a condition also increases. When \(\beta=0.2\), we need to drop more than 5% of the data to kill significance. This makes sense, because the true effect size increases the signal and hence the signal-to-noise ratio.
<!-- define s-to-n; how does beta matter? --></p>
<h2 id="robustness-of-p-hacked-results">Robustness of p-hacked results</h2>
<p>Next let’s see how robust a p-hacked result is. Now I use data</p>
\[\tag{2} y_{i} = \sum_{k=1}^{K} \beta_{k} X_{k,i} + \gamma z_{i} + \varepsilon_{i}.\]
<p>We have \(K\) potential treatment variables, \(X_{1,i}\)
to \(X_{K,i}\), and a control variable \(z_{i}\). I draw
\(X_{k,i}\), \(z_{i}\), and \(\varepsilon_{i}\) from \(N(0,1)\).
I set \(\beta_{k}=0\) for all \(k\), so that \(X_{k}\) has no effect
on \(y\), and the true model is</p>
\[\tag{3} y_{i} = \gamma z_{i} + \varepsilon_{i}.\]
<p>I’m going to p-hack using the \(X_{k}\)’s, running \(K=20\) univariate regressions of \(y\) on \(X_{k}\) and selecting the one with the smallest p-value.
Then I run the AMIP algorithm to calculate the fraction of data needed to kill significance, etc.</p>
<p>In my <a href="https://michaelwiebe.com/blog/2021/01/phack">previous post</a> on p-hacking, we learned that when \(\gamma\) is small, the partial-\(R^{2}(z)\) is small, and controlling for \(z\) is not able to kill coincidental false positives.
To see whether dropping data is a better robustness check, I repeat the above process for different values of \(\gamma\).</p>
<p><img src="https://michaelwiebe.com/assets/amip/falsepos_g.png" alt="" width="100%" /></p>
<p>The results are in Figure 2.
First, notice that we lose significance after dropping a tiny fraction of the data: about 0.3%. For \(N=1000\), that means 3 observations are driving significance.</p>
<p>Second, we see that the fraction dropped doesn’t vary with \(\gamma\) at all.
This is good news: previously, we saw that control variables only kill false positives when they have high partial-\(R^{2}\).
But dropping influential observations is equally effective for any value of \(\gamma\).
So dropping data is an effective robustness check where control variables fail.</p>
<p>Overall, dropping data looks like an effective robustness check against coincidental false positives.
Hopefully this metric becomes a widely used robustness check, and will help root out bad research.</p>
<hr />
<p>Update (Nov. 5, 2021): Ryan Giordano gives a formal explanation <a href="https://rgiordan.github.io/robustness/2021/09/17/amip_p_hacking.html">here</a>.</p>
<hr />
<h2 id="footnotes">Footnotes</h2>
<p>See <a href="https://github.com/maswiebe/metrics/blob/main/amip.r">here</a> for R code.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>In the univariate case, influence = leverage \(\times\) residual. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Is randomization inference a robustness check? For what?2021-01-23T20:00:00+00:00http://michaelwiebe.com/blog/2021/01/randinf<p>I’ve seen a few papers that use <a href="https://egap.org/resource/10-things-to-know-about-randomization-inference/">randomization inference</a> as a robustness check. They permute their treatment variable many times, and estimate their model for each permutation, producing a null distribution of estimates. From this null distribution we can calculate a randomization inference (RI) p-value as the fraction of estimates that are more extreme than the original estimate. (This works because under the null hypothesis of no treatment effect, switching a unit from the treatment to the control group has no effect on the outcome.) These papers show that RI p-values are similar to their conventional p-values, and conclude that their results are robust.</p>
<p>But robust to what, exactly?</p>
<p>Consider the case of using control variables as a robustness check. When adding control to a regression, we’re showing that our result is not driven by possible confounders. If the coefficient loses significance, we conclude that the original effect was spurious. But if the coefficient is stable and remains significant, then we conclude that the effect is not driven by confounding, and we say that it is robust to controls (at least, the ones we included).</p>
<p>Returning to randomization inference, suppose our result is significant using conventional p-values (\(p<0.05\)), but not with randomization inference (\(p_{RI}>0.05\)). What’s happening here? <a href="https://sci-hub.st/https://academic.oup.com/qje/article/134/2/557/5195544">Young (2019)</a> says that conventional p-values can have ‘size distortions’ when the sample size is small and treatment effects are heterogeneous, resulting in concentrated <a href="https://en.wikipedia.org/wiki/Leverage_(statistics)">leverage</a>. This means that the <a href="https://en.wikipedia.org/wiki/Size_(statistics)">size</a>, AKA the false positive rate \(P(\)reject \(H_{0} \mid H_{0}) = P(p<\alpha \mid H_{0})\), is higher than the nominal significance level \(\alpha\).<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> For instance, using \(\alpha =0.05\), we might have a false positive rate of \(0.1\). In this case, conventional p-values are invalid.</p>
<p>By comparison, RI has smaller size distortions. It performs better in settings of concentrated leverage, since it uses an exact test (with a known distribution for any sample size \(N\)), and hence doesn’t rely on convergence as \(N\) grows large. See Young (2019) for details. Upshot: we can think of RI as a robustness test for finite sample bias (in otherwise asymptotically correct variance estimates).</p>
<p>So in the case where we lose significance using RI (\(p<0.05\) and \(p_{RI}>0.05\)), we infer that the original result was driven by finite sample bias. In contrast, if \(p \approx p_{RI} <0.05\), then we conclude that the result is not driven by finite sample bias.</p>
<p>So RI is a useful robustness check for when we’re worried about finite sample bias.
However, this is not the justification I’ve seen when papers use RI as a robustness check.</p>
<h1 id="randomization-inference-in-gavrilova-et-al-2019">Randomization inference in <a href="https://sci-hub.st/https://onlinelibrary.wiley.com/doi/abs/10.1111/ecoj.12521">Gavrilova et al. (2019)</a></h1>
<p>This paper (published in Economic Journal) studies the effect of medical marijuana legalization on
crime, finding that legalization decreases crime in states that border
Mexico. The paper uses a triple-diff method, essentially doing a
diff-in-diff for the effect of legalization on crime, then adding an
interaction for being a border state.</p>
<p>Here’s how they describe their randomization inference exercise (p.24):</p>
<blockquote>
<p>We run an in-space placebo test to test whether the control states form
a valid counterfactual to the treatment states in the absence of
treatment. In this placebo-test, we randomly reassign both the treatment
and the border dummies to other states. We select at random four states
that represent the placebo-border states. We then treat three of them in
1996, 2007 and 2010 respectively, coinciding with the actual treatment
dates in California, Arizona and New Mexico. We also randomly reassign
the inland treatment dummies and estimate [Equation] (1) with the
placebo dummies rather than the actual dummies. [...]</p>
</blockquote>
<blockquote>
<p>If our treatment result is driven by strong heterogeneity in trends, the
placebo treatments will often find an effect of similar magnitude and
our baseline coefficient of -107.98 will be in the thick of the
distribution of placebo-coefficients. On the other hand, if we are
measuring an actual treatment effect, the baseline coefficient will be
in the far left tail of the distribution of placebo-coefficients.
[...]</p>
</blockquote>
<blockquote>
<p>Our baseline-treatment coefficient is in the bottom 3rd-percentile of
the distribution. This result is consistent with a p-value of about 0.03
using a one-sided t-test.</p>
</blockquote>
<p>At first glance, this reasoning seems plausible. I believed it when I
first read this paper. The only problem is that it’s wrong.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<p>To prove this, I run a simulation with differential pre-trends (i.e., a
violation of the common trends assumption). I simulate panel data for 50
states over 1995-2015 according to:</p>
\[\tag{1} y_{st} = \beta D_{st} + \gamma_{s} + \gamma_{t} + \lambda \times D_{s} \times (t-1995) + \varepsilon_{st}.\]
<p>Here \(D_{st}\) is a treatment dummy, equal to 1 in the years a state is
treated. Ten states are treated, with treatment years selected randomly
in 2000-2010; so this is a staggered rollout diff-in-diff. I draw state
and year fixed effects separately from \(N(0,1)\). To generate a false
positive, I set \(\beta=0\) and include a time trend only for the
treatment group: \(\lambda \times D_{s} \times (t-1995)\), where
\(\lambda\) is the value of the trend, and \(D_{s}\) is an ever-treated
indicator.</p>
<p>Because the common trends assumption is not satisfied, diff-in-diff will
wrongly estimate a significant treatment effect. According to Gavrilova
et al., randomization inference will yield a large, nonsignificant
p-value, since the placebo treatments will also pick up the differential
trends and give similar estimates.</p>
<p>But this doesn’t happen. When I set a trend value of \(\lambda=0.1\), I
get a significant diff-in-diff estimate 100% of the time, using both
conventional and RI p-values. The actual coefficient is always in the
tail of the randomization distribution, and not "in the thick" of it.
The false positive is just as significant when using RI.</p>
<p>More generally, when varying the trend, I find that \(p \approx p_{RI}\).
Figure 1 shows average rejection rates and p-values for the diff-in-diff
estimate across 100 simulations, for different values of differential
trends. We see that, on average, conventional and RI p-values are almost
identical. As a result, the rejection rates are also similar.</p>
<p><img src="https://michaelwiebe.com/assets/randinf/trends.png" alt="" width="100%" /></p>
<p>From the discussion above, we expect \(p_{RI}\) to differ when leverage is
concentrated, due to small sample size and heterogeneous effects. Since
this is not the case here, RI and conventional p-values are similar.</p>
<p>So Gavrilova et al.’s small randomization inference p-value does not
prove that their result isn’t driven by differential pre-trends. False
positives driven by differential trends also have small RI p-values.
Randomization inference is a robustness check for finite sample bias,
and nothing more.</p>
<hr />
<h1 id="appendix-p-hacking-simulations">Appendix: p-hacking simulations</h1>
<p>Here I run simulations where I p-hack a significant result in different
setups, to see whether randomization inference can kill a false
positive. I calculate a RI p-value by reshuffling the treatment variable
and calculating the t-statistic. I repeat this process 1000 times, and
calculate \(p_{RI}\) as the fraction of randomized t-statistics that are
larger (in absolute value) than the original t-statistic. According to
Young (2019), using the t-statistic produces better performance than
using the coefficient.</p>
<h2 id="1-simple-ols">(1) Simple OLS</h2>
<h3 id="constant-effects-beta0">Constant effects: \(\beta=0\)</h3>
<p>Data-generating process (DGP) with \(\beta=0\):</p>
\[\tag{2} y_{i} = \sum_{k=1}^{K} \beta_{k} X_{k,i} + \varepsilon_{i} = \varepsilon_{i}\]
<p>I p-hack a significant result by regressing \(y\) on \(X_{k}\) for
\(k=1:K=20\), and selecting the \(X_{k}\) with the smallest p-value. I use
\(N=1000\) and a significance level of \(\alpha=0.05\).</p>
<p>Running 1000 simulations, I find:</p>
<ul>
<li>
<p>An average rejection rate of 5.1% for the \(K=20\) regressions.</p>
</li>
<li>
<p>653 estimates (65%) are significant.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
</li>
<li>
<p>Out of these 653, 638 (98%) are significant using RI.</p>
</li>
</ul>
<p>In this simple vanilla setup, RI and classical p-values are basically
identical. So randomization inference is completely ineffective at
killing false positives in a setting with large sample size and
homogeneous effects.</p>
<h3 id="heterogeneous-effects-beta_i-sim-n01">Heterogeneous effects: \(\beta_{i} \sim N(0,1)\)</h3>
<p>DGP with \(\beta_{k,i} \sim N(0,1)\):</p>
\[\tag{3} y_{i} = \sum_{k=1}^{K} \beta_{k,i} X_{k,i} + \varepsilon_{i}\]
<p>Again, I p-hack by cycling through the \(X_{k}\)’s and selecting the most
significant one.</p>
<p>From 1000 simulations, I find:</p>
<ul>
<li>
<p>An average rejection rate of 5.2%.</p>
</li>
<li>
<p>645 estimates (65%) are significant.</p>
</li>
<li>
<p>Out of these 645, 638 (99%) are significant using RI.</p>
</li>
</ul>
<p>So even with heterogeneous effects, \(N=1000\) is enough to avoid finite
sample bias, so RI p-values are no different.</p>
<h2 id="2-difference-in-differences">(2) Difference in differences</h2>
<p>Next I simulate panel data and estimate a diff-in-diff model as above,
but with no differential trends.</p>
<h3 id="constant-effects-beta0-1">Constant effects: \(\beta=0\)</h3>
<p>DGP:</p>
\[\tag{4} y_{st} = \beta D_{st} + \gamma_{s} + \gamma_{t} + \varepsilon_{st}\]
<p>I simulate panel data for 50 states over 1995-2015. 10 states are treated,
with treatment years selected randomly in 2000-2010; so this is a
staggered rollout diff-in-diff. I draw state and year fixed effects
separately from \(N(0,1)\). To generate a false positive, I set \(\beta=0\).</p>
<p>I p-hack a significant result by regressing \(y\) on K different treatment
assignments \(D_{k,st}\) in a two-way fixed effects model, and selecting
the regression with the smallest p-value. I cluster standard errors at
the state level.</p>
<p>From 1000 simulations, I get</p>
<ul>
<li>
<p>An average rejection rate of 7.9%.</p>
</li>
<li>
<p>815 estimates (82%) are significant.</p>
</li>
<li>
<p>Out of these 815, 653 (80%) are significant using RI.</p>
</li>
</ul>
<p>So now we are seeing a size distortion using conventional p-values: out
of the \(K=20\) regressions, 7.9% are significant, instead of the expected
5%. This appears to be driven by a small sample size and imbalanced
treatment: 10 out of 50 states are treated. When I redo this exercise
with \(N=500\) states and 250 treated, I get the expected 5% rejection
rate.</p>
<p>However, RI doesn’t seem to be much of an improvement, as the majority
of false positives remain significant using \(p_{RI}\).</p>
<h3 id="heterogeneous-effects-beta_s-sim-n01">Heterogeneous effects: \(\beta_{s} \sim N(0,1)\)</h3>
<p>DGP:</p>
\[\tag{5} y_{st} = \beta_{s} D_{st} + \gamma_{s} + \gamma_{t} + \varepsilon_{st}\]
<p>Now I repeat the same exercise, but with the 10 treated states having
treatment effects drawn from \(N(0,1)\).</p>
<p>From 1000 simulations, I get</p>
<ul>
<li>
<p>An average rejection rate of 7.7%.</p>
</li>
<li>
<p>790 estimates (79%) are significant.</p>
</li>
<li>
<p>Out of these 790, 687 (87%) are significant using RI.</p>
</li>
</ul>
<p>With heterogeneous effects as well as imbalanced treatment, RI performs
even worse at killing the false positive.</p>
<h2 id="what-does-it-take-to-get-p_ri-neq-p">What does it take to get \(p_{RI} \neq p\)?</h2>
<p>I can get the RI p-value to differ by 0.1 (on average) when using
\(N=20\), \(X \sim B(0.5)\), and \(\beta_{i} \sim\) lognormal\((0,2)\). So it
takes a very small sample, highly heterogeneous treatment effects, and a
binary treatment variable to generate a finite sample bias that is
mitigated by randomization inference. Here’s a graph showing how
\(|p-p_{RI}|\) varies with \(Var(\beta_{i})\):</p>
<p><img src="https://michaelwiebe.com/assets/randinf/p_diff_xbern.png" alt="" width="100%" /></p>
<p>So it is possible for RI p-values to diverge substantially from conventional p-values, but it requires a pretty extreme scenario.</p>
<hr />
<h2 id="footnotes">Footnotes</h2>
<p>See <a href="https://github.com/maswiebe/metrics/blob/main/randinf.r">here</a> for R code.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>With a properly-sized test, \(P(\)reject \(H_{0} \mid H_{0}) = \alpha\). <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Another paper that uses a RI strategy is <a href="https://sci-hub.st/https://link.springer.com/article/10.1007/s10887-015-9116-1">Yao and Zhang (2015)</a>.
They use RI on a three-way fixed-effects model, regressing GDP
growth on leader, city, and year FEs. They give a similar rationale
for randomization inference:</p>
<blockquote>
<p>Our second robustness check is a placebo test that permutes leaders’
tenures. If our estimates of leader effects only picked up
heteroskedastic shocks, we would have no reason to believe that
these shocks would form a consistent pattern that follows the cycle
of leaders’ tenures. For that, we randomly permute each leader’s
tenures across years within the same city and re-estimate Eq. (1).
[...] We find that the F-statistic from the true data is either
Nos. 1 or 2 among the F-statistics from any round of permutation.
This result gives us more confidence in our baseline results.</p>
</blockquote>
<p><a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>As expected, since when \(\alpha=0.05\), P(at least one significant)
= \(1 -\)P(none significant) = \(1 - (1-0.05)^{20}\) = 0.64. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
How to p-hack a robust result2021-01-16T20:00:00+00:00http://michaelwiebe.com/blog/2021/01/phack<p>Economists want to show that our results are robust, like in Table 1 below: Column 1 contains the baseline model, with no covariates, and Column 2 controls for \(z\). Because the coefficient on \(X\) is stable and significant across columns, we say that our result is robust.</p>
<p><img src="https://michaelwiebe.com/assets/p-hack_robust/table1_small.png" alt="" width="90%" /></p>
<p>The twist: I p-hacked this result, using data where the true effect of
\(X\) is zero.</p>
<p>In this post, I show that it can be easy to p-hack a robust result like this.
Here’s the basic idea: first, p-hack a significant result by running regressions with many different treatment variables, where the true treatment effects are all zero. For 20 regressions, we expect to get one false positive: a result with \(p<0.05\). Then, using this significant treatment variable, run a second regression including a control variable, to see whether the result is robust to controls.</p>
<p>It turns out that the key to p-hacking robust results is to use control variables that have a low partial-\(R^{2}\). These variables don’t have much influence on our main coefficient when excluded from the regression, and also have little influence when included. In contrast, controls with high partial-\(R^{2}\) are more likely to kill a false positive.
Lesson: high partial-\(R^{2}\) controls are an effective robustness check against false positives.</p>
<h1 id="setup">Setup</h1>
<p>Let’s see how this works. Consider data for \(i=1, ..., N\) observations
generated according to</p>
\[\tag{1} y_{i} = \sum_{k=1}^{K} \beta_{k} X_{k,i} + \gamma z_{i} + \varepsilon_{i}.\]
<p>We have \(K\) potential treatment variables, \(X_{1,i}\)
to \(X_{K,i}\), and a control variable \(z_{i}\). I draw
\(X_{k,i} \sim N(0,1)\), \(z_{i} \sim N(0,1)\), and
\(\varepsilon_{i} \sim N(0,1)\), so that \(X_{k,i}\), \(z_{i}\), and
\(\varepsilon_{i}\) are all independent, but could be correlated in the
sample. I set \(\beta_{k}=0\) for all \(k\), so that \(X_{k}\) has no effect
on \(y\), and the true model is</p>
\[\tag{2} y_{i} = \gamma z_{i} + \varepsilon_{i}.\]
<p>I’m going to p-hack using the \(X_{k}\)’s, running \(K\) regressions and
selecting the \(k^{*}\) with the smallest p-value. I p-hack the baseline
regression of \(y\) on \(X_{k}\), by running \(K\) regressions of the form</p>
\[\tag{3} y_{i} = \alpha_{1,k} + \beta_{1,k}X_{k,i} + \nu_{i}.
% kinda abusing notation here, since beta_{1,k} is not the coefficient in the DGP (beta_{k})\]
<p>I use the ‘1’ subscript to indicate that this is the baseline model in
Column 1. Out of these \(K\) regressions, I select the \(k^{*}\) with the
smallest p-value on \(\beta_{1}\). That is, I select the regression</p>
\[\tag{4} y_{i} = \alpha_{1,k^{*}} + \beta_{1,k^{*}}X_{k^{*},i} + \nu_{i}.\]
<p>When \(K\geq 20\), we expect \(\hat{\beta}_{1,k^{*}}\) to have \(p<0.05\),
since with a 5% significance level (i.e., false positive rate), the
average number of significant results is \(20\times0.05 = 1\). This is our
p-hacked false positive.</p>
<p>To get a robust sequence of regressions, I need my full model including
\(z\) to also have a significant coefficient on \(X_{k^{*},i}\). To test
this, I run my Column 2 regression:</p>
\[\tag{5} y_{i} = \alpha_{2,k^{*}} + \beta_{2,k^{*}}X_{k^{*},i} + \gamma z_{i} + \varepsilon_{i}.\]
<p>Given that we p-hacked a significant \(\hat{\beta}_{1,k^{*}}\), will
\(\hat{\beta}_{2,k^{*}}\) also be significant?</p>
<h1 id="homogeneous-beta0">Homogeneous \(\beta=0\)</h1>
<p>First, I show a case where p-hacked results are not robust. I use the
data-generating process from above with \(\beta=0\).</p>
<p>When regressing \(y\) on \(X_{k}\) in the p-hacking step, we have</p>
\[\tag{6} y_{i} = \alpha_{1,k} + \beta_{1,k} X_{k,i} + \nu_{i},\]
<p>where</p>
\[\tag{7} \begin{align}
\nu_{i} &= \sum_{j \neq k}^{K} \beta_{1,j} X_{j,i} + \gamma z_{i} + \varepsilon_{i} \\
&= \gamma z_{i} + \varepsilon_{i}.
\end{align}\]
<p>We estimate the slope coefficient as</p>
\[\tag{8} \hat{\beta}_{1,k} = \frac{\widehat{Cov}(X_{k},y)}{\widehat{Var}(X_{k})} = \frac{\gamma \widehat{Cov}(X_{k},z) + \widehat{Cov}(X_{k},\varepsilon)}{\widehat{Var}(X_{k})}.\]
<p>Since \(\beta=0\), we should only find a significant \(\hat{\beta}_{1,k}\)
due to a correlation between \(X_{k}\) and the components of the error
term \(\nu_{i}\): (1) \(\gamma \widehat{Cov}(X_{k},z)\), and (2) \(\widehat{Cov}(X_{k},\varepsilon)\).</p>
<p>When \(\gamma \widehat{Cov}(X_{k},z)\) is the primary driver of
\(\hat{\beta}_{1,k}\), controlling for \(z\) in Column 2 will kill the false
positive.</p>
<p>Turning to the full regression in Column 2, we get</p>
\[\begin{align} \tag{9}
\hat{\beta}_{2,k} &= \frac{\widehat{Cov}(\hat{u},y)}{\widehat{Var}(\hat{u})} = \frac{\widehat{Cov}((X_{k} - \hat{\lambda}_{1} z),\varepsilon)}{\widehat{Var}(\hat{u})} \\
&= \frac{\widehat{Cov}(X_{k},\varepsilon) - \hat{\lambda}_{1} \widehat{Cov}(z,\varepsilon)}{\widehat{Var}(\hat{u})}.
\end{align}\]
<p>This is from the two-step Frisch-Waugh-Lovell method, where we first
regress \(X_{k}\) on \(z\) (\(X_{k} = \lambda_{0} + \lambda_{1} z + u\)) and
take the residual
\(\hat{u} = X_{k} - \hat{\lambda}_{0} - \hat{\lambda}_{1} z\). Then we
regress \(y\) on \(\hat{u}\), using the variation in \(X_{k}\) that’s not due
to \(z\), and the resulting slope coefficient is \(\hat{\beta}_{2,k}\).<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
We can see that controlling for \(z\) literally removes the
\(\gamma \widehat{Cov}(X_{k},z)\) term from our estimate.</p>
<p>Hence, to p-hack robust results, we want \(\hat{\beta}_{1,k}\) to be
driven by \(\widehat{Cov}(X_{k},\varepsilon)\), since that term is also in
\(\hat{\beta}_{2,k}\). If we have a significant result that’s not driven
by \(z\), then controlling for \(z\) won’t affect our significance.</p>
<h2 id="simulations">Simulations</h2>
<p>Setting \(K=20, N=1000\), and \(\gamma=1\), I perform \(1000\) replications of
the above procedure: I run 20 regressions, select the most significant
\(X_{k^{*}}\) and record the p-value on \(\hat{\beta}_{1,k^{*}}\), then add
\(z\) to the regression and record the p-value on \(\hat{\beta}_{2,k^{*}}\).
As expected when using a \(5\%\) significance level, I find that out of
the \(K\) regressions in the p-hacking step, the average number of
significant results is \(0.05\). I find that \(\hat{\beta}_{1,k^{*}}\) is
significant in 663 simulations (=66%). But only 245 simulations (=25%)
have both a significant \(\hat{\beta}_{1,k^{*}}\) and a significant
\(\hat{\beta}_{2,k^{*}}\), meaning that only 37% (=245/663) of p-hacked
Column 1 results have a significant Column 2. So in the \(\beta=0\) case,
we infer that \(\widehat{Cov}(X_{k},\varepsilon)\) is small relative to
\(\gamma \widehat{Cov}(X_{k},z)\). With these parameters, it’s not easy to
p-hack robust results.</p>
<!-- ![Shares of $$\hat{\beta}_{1,k}$$ and robustness][{label="fig0"}](https://michaelwiebe.com/assets/b0_shares_signif.pdf){#fig0} -->
<p><img src="https://michaelwiebe.com/assets/p-hack_robust/b0_shares_signif.png" alt="" width="90%" /></p>
<p>Figure 1 repeats this process for a range of \(\gamma\)’s. I plot the shares of
\(\gamma \widehat{Cov}(X_{k},z)\) and \(\widehat{Cov}(X_{k},\varepsilon)\)
in \(\hat{\beta}_{1,k}\).<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> We see that when \(\gamma=0\),
\(\gamma \widehat{Cov}(X_{k},z)\) has 0 weight, but its share increases
quickly. Closely correlated with this share is the fraction of
significant results losing significance after controlling for \(z\).
Specifically, this is the fraction of simulations with a nonsignificant
\(\hat{\beta}_{2,k}\), out of the simulations with a significant
\(\hat{\beta}_{1,k}\). And even more tightly correlated with
\(\gamma \widehat{Cov}(X_{k},z)\) is the partial \(R^{2}\) of \(z\).<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>
Intuitively, as \(\gamma\) increases, the additional improvement in model
fit from adding \(z\) also increases, which by definition increases
\(R^{2}(z)\). Hence, \(R^{2}(z)\) turns out to be a useful proxy for the
share of \(\gamma \widehat{Cov}(X_{k,i},z_{i})\), which we can’t calculate
in practice. Lesson: when partial-\(R^{2}(z)\) is large, controlling for
\(z\) is an effective robustness check for false positives. This is
because a large \(\gamma \widehat{Cov}(X_{k},z)\) implies both (1) a large
\(R^{2}(z)\); and (2) that \(z\) is more likely to be the source of the
false positive, and hence controlling for \(z\) will kill it. So now we
have a new justification for including control variables, apart from
addressing confounders: to rule out false positives driven by
coincidental sample correlations.</p>
<h1 id="heterogeneous-beta_i-sim-n01">Heterogeneous \(\beta_{i} \sim N(0,1)\)</h1>
<p>However, you might think that \(\beta=0\) is not a realistic assumption.
As Gelman says: “anything that plausibly could have an effect will not
have an effect that is exactly zero.” So let’s consider the case of
heterogeneous \(\beta_{i}\), where each individual \(i\) has their own
effect drawn from \(N(0,1)\). For large \(N\), the average effect of \(X\) on
\(y\) will be 0, but this effect will vary by individual. This is a more
plausible assumption than \(\beta\) being uniformly 0 for everyone. And as
we’ll see, this also helps for p-hacking, by increasing the variance of
the error term.</p>
<p>Here we have data generated according to</p>
\[\tag{10} y_{i} = \sum_{k=1}^{K} \beta_{k,i} X_{k,i} + \gamma z_{i} + \varepsilon_{i},\]
<p>where \(\beta_{k,i} \sim N(0,1)\).</p>
<p>Then, when regressing \(y\) on \(X_{k}\), we have</p>
\[\tag{11} y_{i} = \alpha_{1,k} + \delta_{1,k} X_{k,i} + v_{i},\]
<p>where</p>
\[\tag{12} v_{i} = -\delta_{1,k} X_{k,i} + \beta_{k,i} X_{k,i} + \sum_{j \neq k}^{K} \beta_{j,i} X_{j,i} + \gamma z + \varepsilon_{i}.\]
<p>When effects are heterogeneous (i.e., we have \(\beta_{k,i}\) varying with
\(i\)), a regression model with a constant slope \(\delta_{1,k}\) is
misspecified. To emphasize this, I include \(-\delta_{1,k} X_{k,i}\) in
the error term.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>
<p>The estimated slope coefficient is</p>
\[\tag{13} \begin{align}
\hat{\delta}_{1,k} &= \frac{\widehat{Cov}(X_{k,i},y_{i})}{\widehat{Var}(X_{k,i})} \\
&= \frac{\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i}) + \gamma \widehat{Cov}(X_{k,i},z_{i}) + \widehat{Cov}(X_{k,i},\varepsilon)_{i}}{\widehat{Var}(X_{k,i})}
\end{align}\]
<p>From Aronow and Samii (2015), we know that the slope coefficient
converges to a weighted average of the \(\beta_{k,i}\)’s:</p>
\[\tag{14} \hat{\delta}_{1,k} \rightarrow \frac{E[w_{i} \beta_{k,i}]}{E[w_{i}]},\]
<p>where \(w_{i}\) are the regression weights: the residuals from regressing
\(X_{k}\) on the other controls. In this case, as we’re using a univariate
regression, the residuals are simply demeaned \(X_{k}\) (when regressing
\(X\) on a constant, the fitted value is \(\bar{X}\)).</p>
<p>Because \(\beta_{k,i} \sim N(0,1)\), we have \(E[w_{i} \beta_{k,i}]=0\) and
hence \(\hat{\delta}_{1,k}\) converges to 0. So any statistically
significant \(\hat{\delta}_{1,k}\) that we estimate will be a false
positive.</p>
<p>There are three terms that make up \(\hat{\delta}_{1,k}\) and could drive
a false positive: (1) \(\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i})\), (2) \(\gamma \widehat{Cov}(X_{k},z)\), and (3) \(\widehat{Cov}(X_{k},\varepsilon)\).</p>
<p>Now we have a new source of false positives, case (1), due to
heterogeneity in \(\beta_{k,i}\). Note that controlling for \(z\) will only
affect one out of three possible drivers, so now we should expect our
false positives to be more robust to control variables, compared to when
\(\beta=0\). To see this, note that when controlling for \(z\) in the full
regression, we have</p>
\[\tag{15} \begin{align}
\hat{\delta}_{2,k} &= \frac{\widehat{Cov}(\hat{u}_{i},y_{i})}{\widehat{Var}(\hat{u}_{i})} \\
&= \frac{\sum_{j=1}^{K} \widehat{Cov}(X_{k,i} - \hat{\lambda}_{1} z_{i},\beta_{j,i}X_{j,i}) + \widehat{Cov}(X_{k,i} - \hat{\lambda}_{1} z_{i},\varepsilon_{i})}{\widehat{Var}(\hat{u}_{i})} \\
&= \frac{\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i}) + \widehat{Cov}(X_{k,i},\varepsilon_{i})}{\widehat{Var}(\hat{u_{i}})} \\
&- \hat{\lambda}_{1} \frac{\left[ \sum_{j=1}^{K} \widehat{Cov}(z_{i},\beta_{j,i}X_{j,i}) + \widehat{Cov}(z_{i},\varepsilon_{i}) \right] }{\widehat{Var}(\hat{u_{i}})}
\end{align}\]
<p>Here \(\hat{u}\) is the residual from a regression of
\(X_{k}\) on \(z\): \(X_{k} = \lambda_{0} + \lambda_{1} z + u\). We obtain
\(\hat{\delta}_{2,k}\) by regressing \(y\) on \(\hat{u}\), via FWL, and using
the variation in \(X_{k}\) that’s not due to \(z\).</p>
<p>Comparing \(\hat{\delta}_{1,k}\) to \(\hat{\delta}_{2,k}\), we see that
\(\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i}) + \widehat{Cov}(X_{k_{i}},\varepsilon_{i})\)
shows up in both estimates. Hence, if our p-hacking selects for a
\(\hat{\delta}_{1,k}\) with a large value of these terms, we’re also
selecting for the majority of the components of \(\hat{\delta}_{2,k}\). In
contrast to the \(\beta=0\) case, now we should expect
\(\gamma \widehat{Cov}(X_{k,i},z_{i})\) to be dominated, and significance
in Column 1 should carry over to Column 2.</p>
<h2 id="simulations-1">Simulations</h2>
<p>I repeat the same procedure as before, running \(K=20\) regressions of \(y\)
on \(X_{k}\) and \(z\), taking the \(X_{k}\) with the smallest p-value,
\(X_{k^{*}}\), and then running another regression while excluding \(z\).
Again, I use \(\gamma=1\) and perform \(1000\) replications. Here I use
robust standard errors to address heteroskedasticity.</p>
<p>I find that \(\hat{\delta}_{1,k^{*}}\) is significant in 650 simulations
(=65%). But this time, 569 simulations (=57%) have both a significant
\(\hat{\delta}_{1,k^{*}}\) and a significant \(\hat{\delta}_{2,k^{*}}\). So
88% (=569/650) of p-hacked Column 1 estimates also have a significant
Column 2. Compare this to 37% in the \(\beta=0\) case. That’s what I call
p-hacking a robust result! We infer that
\(\gamma \widehat{Cov}(X_{k,i},z_{i})\) is too small relative to the other
components for its presence or absence to affect our estimates very
much.</p>
<p>To illustrate how \(\hat{\delta}_{1,k}\) is determined, I plot the shares
of its three constituent terms while varying \(\gamma\).<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>
<p><img src="https://michaelwiebe.com/assets/p-hack_robust/bhet_shares.png" alt="" width="90%" /></p>
<p>As shown in Figure 2, when \(\gamma\) is small, most of the weight in
\(\hat{\delta}_{1,k}\) is from
\(\sum_{j=1}^{K} \widehat{Cov}(X_{k,i},\beta_{j,i}X_{j,i})\), indicating
that its \(K\) terms provide ample opportunity for correlations with
\(X_{k^{*},i}\). But as \(\gamma\) increases, this share falls, while the
share of \(\gamma \widehat{Cov}(X_{k,i},z_{i})\) rises linearly. The share
of \(\widehat{Cov}(X_{k,i},\varepsilon_{i})\) is small and decreases
slightly. Looking at robustness, we see that the fraction of significant
results losing significance rises much more slowly than in the \(\beta=0\)
case. And we again see a tight link between partial-\(R^{2}(z)\) and the
share of \(z\) in \(\hat{\delta}_{1,k}\).<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup></p>
<p>Overall, we can see why controlling for \(z\) is less effective with
heterogeneous effects: \(\hat{\delta}_{1,k}\) is mostly <em>not</em> determined
by \(\gamma \widehat{Cov}(X_{k,i},z_{i})\), so removing it (by controlling
for \(z\)) has little effect. In other words, when variables have low
partial-\(R^{2}\), controlling for them won’t affect false positives.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In general, economists think about robustness in terms of addressing
potential confounders. I haven’t seen any discussion of robustness to
false positives based on coincidental sample correlations. This is
possibly because it seems hopeless: we always have a 5% false positive
rate, after all. But as I’ve shown, adding high partial-\(R^{2}\) controls
is an effective robustness check against p-hacked false positives.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>
So we have a new weapon to combat false positives: checking whether a
result remains significant as high partial-\(R^{2}\) controls are added to
the model.</p>
<h2 id="footnotes">Footnotes</h2>
<p>See <a href="https://github.com/maswiebe/metrics/blob/main/p-hack.r">here</a> for R code. Click <a href="https://michaelwiebe.com/assets/p-hack_robust/p-hack.pdf">here</a> for a pdf version of this post.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>\(\widehat{Cov}(\hat{u},y) = \widehat{Cov}(\hat{u},\gamma z + \varepsilon) = \gamma \widehat{Cov}(\hat{u},z) + \widehat{Cov}(\hat{u},\varepsilon) = 0 + \widehat{Cov}(\hat{u},\varepsilon)\),
since the residual \(\hat{u}\) is orthogonal to \(z\). <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Note that these terms can be negative, so this is not strictly a
share in \([0,1]\). When the terms in the denominator almost cancel
out to 0, we get extreme values. Hence, for each \(\gamma\), I take
the median share across all simulations, which is well-behaved. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>\(R^{2}(z) = \frac{\sum \hat{u}_{i}^{2} - \sum \hat{v}_{i}^{2}}{\sum \hat{u}_{i}^{2}}\),
where \(\hat{u}_{i}^{2}\) is the residual from the baseline model, and
\(\hat{v}_{i}^{2}\) is the residual from the full regression (where we
control for \(z\)). In other words, partial \(R^{2}(z)\) is the
proportional reduction in the sum of squared residuals from adding
\(z\) to the model. See also the <a href="https://www.rdocumentation.org/packages/asbio/versions/1.6-7/topics/partial.R2">R function</a> and <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination#Coefficient_of_partial_determination">Wikipedia</a>. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>We could write
\(\beta_{k,i} = \bar{\beta}_{k,i} + (\beta_{k,i}-\bar{\beta}_{k,i}) := b_{k} + b_{k,i}\),
and then have \(y_{i} = \alpha_{1,k} + b_{k} X_{k,i} + v_{i}\), with
\(v_{i} = b_{k,i}X_{k,i} + \sum_{j \neq k}^{K} \beta_{j,i} X_{j,i} + \gamma z_{i} + \varepsilon_{i}\).
However, \(\hat{b}_{k}\) does not generally converge to
\(b_{k} = \bar{\beta}_{k,i}\), as I discuss below. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>Similar results hold when varying \(Var(\beta_{i})\) or
\(Var(\varepsilon)\). <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>Note that the overall \(R^{2}\) in Column 1 is irrelevant. For
\(\alpha=0.05\), we will always have a false positive rate of 5% when
the null hypothesis is true. Controlling for \(z\) is effective when
\(\gamma \widehat{Cov}(X_{k,i},z_{i})\) has a large share in
\(\hat{\delta}_{1,k}\). And a large share also means that \(R^{2}(z)\)
is large. This is true whether the overall \(R^{2}\) is 0.01 or 0.99,
since partial \(R^{2}\) is defined in relative terms, as the decrease
in the sum of squared residuals relative to a baseline model. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>Note that this holds regardless of which regression is p-hacked.
Here, I’ve p-hacked the baseline regression. But the results are
actually identical when you work backwards, p-hacking the full
regression and then excluding \(z\). <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Why we shouldn't take p-values literally2020-11-04T23:20:00+00:00http://michaelwiebe.com/blog/2020/11/pvalues<p>PSA: if you read a paper claiming p<0.01, you shouldn’t automatically take that p-value literally.</p>
<p>By definition, a p-value is the probability of getting a result at least as extreme as the one in your sample, assuming the null hypothesis H0 is true. In other words, assuming H0 is true, if you collected many more samples, and ran the same testing procedure, you’d expect to find that p% of the results are more extreme than the original result. Notice: this requires running the exact same test on every sample.</p>
<p>This is equivalent to preregistering your test. If, instead, you first looked at the data before designing your test (to check for correlations, try different dependent variables, or even engage in outright p-hacking), you would violate the definition. In this data-dependent testing procedure, for every new sample you collect, you look at the data and then choose a test. That is, for every new sample you would potentially run a different test, which contradicts the requirement to run the same test.</p>
<p>Hence, the burden of proof lies with the researcher to convince us that, given a new sample, they would run the same test. Preregistration is the easiest way to do this: it guarantees that the researcher chose their test before looking at the data. But if the hypothesis is not preregistered, and looks like the <a href="https://stat.columbia.edu/~gelman/research/published/ForkingPaths.pdf">garden of forking paths</a>, then we should feel free to reject their p-value as invalid.</p>
<p>For example, suppose a paper reports a null average effect and poorly-motivated p<0.01 interaction effect. This looks suspiciously like they started out expecting an average effect, didn’t find one, so scrambled to find something, anything with three stars, then made up a rationalization for why that interaction was what they were planning all along. But obviously this is data-dependent testing. We have little reason to believe that the researcher would run the same test on other samples. Hence, we also have little reason to take the p-value seriously.</p>
<p>Lesson: consumers of non-preregistered research need to do more than just look at the p-value. We also have to make a judgment call about whether that p-value is well-defined.</p>