Critical Inquiry Critical Inquiry

A Response That Isn’t


Chad Wellmon, Andrew Piper, and Yuancheng Zhu

 

The post by Jordan Brower and Scott Ganz is less a response than an attempt to invalidate by suggestion. Debate over the implications of specific measurements or problems in data collection are essential to scholarly inquiry. But their post fails to deliver empirical evidence for their main argument: that our descriptions of gender bias and the concentration of institutional prestige in leading humanities journals should be met with deep doubts. Their ethos and strategy of argumentation is to instill doubt via suspicion rather than achieve clarity about the issues at stake. They do so by proposing strict disciplinary hierarchies and methodological fault lines as a means of invalidating empirical evidence.

Yet as we will show, their claims are based on a misrepresentation of the essay’s underlying arguments; unqualified overstatements of the invalidity of one measure used in the essay; and the use of anecdotal evidence to disqualify the study’s data. Under the guise of empirical validity, their post conceals its own interpretive agenda and plays the very game of institutional prestige that our article seeks to understand and bring to light.

We welcome and have already received pointed criticisms and incisive engagements from others. We will continue to incorporate these insights as we move forward with our work. We agree with Brower and Ganz that multiple disciplinary perspectives are warranted to fully understand our object of study. For this reason we have invited Yuancheng Zhu, a former PhD in statistics and now research fellow at the Wharton School of the University of Pennsylvania, to review our findings and offer feedback.

With respect to the particular claims Brower and Ganz make, we will show:

  1. they address only a portion of––and only two of seven total tables and figures in––an article whose findings they wish to refute;
  2. their proposed heterogeneity measure is neither mathematically more sound nor empirically sufficient to invalidate the measure we chose to prioritize;
  3. their identification of actual errors in the data set do not invalidate the statistical significance of our findings;
  4. their anecdotal reasoning is ultimately deployed to defend a notion of “quality” as an explanation of extreme institutional inequality, a defense for which they present no evidence.

1. Who Gets to Police Disciplinary Boundaries?

Brower and Ganz argue that our essay belongs to the social sciences and, therefore, that neither the humanities nor the field to which it actually aspires to belong, cultural analytics, has a legitimate claim to the types of argument, evidence, and knowledge that we draw upon. Such boundary keeping is one of the institutional norms we hoped to put into question in our essay, because it is a central strategy for establishing and maintaining epistemic authority.

But Brower and Ganz’s boundary policing is self-serving. Although they identify the entire essay as “social science,” they only discuss sections that account for roughly 35 percent of our original article and only two of seven figures and tables presented as evidence. Our essay sought to address a complex problem, and so we brought together multiple ways of understanding a problem, from historical and conceptual analysis to contemporary data, in order better to understand institutional diversity and gender bias. Brower and Ganz ignored a majority of our essay and yet sought to invalidate it in its entirety.

2. Claiming that HHI Is “Right” Is Wrong.

Brower and Ganz focus on two different methods of measuring inequality as discussed in our essay, and they suggest that our choice of method undermines our entire argument. In the process, they suggest that we did not use two different measures or discuss HHI (or talk about other things like gender bias). They also omit seven other possible measures we could have used. In other words, they present a single statistical measure as a direct representation of reality, rather than one method to model a challenging concept.

If we view the publication status of each year as a probability distribution over the institutions, then coming up with a single score is simply trying to use one number to summarize a multidimensional object. Doing so inevitably requires a loss of information, no matter how one chooses to do it. Just like mean, median, or mode summarizes the average position of a sample or a distribution, type-token score—or the HH index—summarizes “heterogeneity” from different perspectives. Brower and Ganz call the use of type-token ratio a “serious problem,” but in most circumstances one does not call using mean rather than median to summarize data a serious problem.

If there is not a single appropriate score to use, which one should we choose? The first question is what assumptions we are trying to model. The type-token ratio we used assumes that the ratio of institutions to articles is a good representation of inequality. The small number of institutions represented across articles suggests that there is a lack of diversity in the institutional landscape of publishing. The HH index looks at the market share of each actor (here, institutions), so that the more articles that an institution commands, the more concentrated the “industry” is thought to be. Because the HH index is typically used to measure financial competitiveness, it is based on the assumption that simply increasing the number of actors in the field decreases the inequality among institutional representation––that is, that more companies means more competitiveness. But as we argue in our piece, this is not an assumption we wanted to make.

Here is a demonstration of the problem drawn from the email correspondence from Ganz that we received prior to publication:

For example, imagine in year 1, there are 10 articles distributed equally among five institutions. Your heterogeneity metric would provide a score of 5/10 = 0.5.

Then in year 2, there are 18 articles distributed equally among six institutions. We would want this to be a more heterogeneous population (because inequality has remained the same, but the total number of institutions has increased). However, according to your metrics, this would indicate less heterogeneity (6/18 = 0.33).

In our case, we do not actually want the second example to suggest greater heterogeneity. In effect the number of articles has increased by 60 percent, but the institutional diversity by only 20 percent. In our view heterogeneity has decreased in this scenario, not increased. More institutions (the actors in the model) is for us not an inherent good. It’s the ratio of institutions to articles that matters most to us.

The second way to answer the question is to understand the extent to which each measure would (or would not) represent the underlying distributions of the data in different ways. Assuming that the number of articles for each journal is relatively similar each year, the type-token score and the HH index actually belong to the same class of metric, the Renyi entropy. The HH index is equivalent to the entropy with alpha equal to 2 (ignoring the log and the constant), and the type-token score corresponds to when alpha equals 0 (it is log of the number of “types”; we assume that the number of tokens is relatively constant). To put it in a more mathematical way, HH index corresponds to the L2 norm of the probability vector, and type-token score corresponds to the L0 norm. Given that the L1 norm of the probability vector is 1 (probabilities sum up to 1), the HH index and type-token score tend to be negatively correlated. There is, then, not much of a difference between options. Another special case is when alpha = 1, which is the usual definition of entropy.

A big assumption is that the number of articles each year stays relatively constant. It is also debated and debatable which one (TT score or HH index) is more sensitive to the sample size. If we look at the article distributions for each journal, the assumption of a constant number of articles is in this case a fair one to make. Once becoming nonzero, the number of publications for each journal stays relatively unchanged, in terms of scale. It is indeed the case that sample size will affect both metrics, just like sample entropy will be affected by sample size. One could eliminate the effect of sample size by randomly downsizing each year to the same number (or maybe aggregating neighboring years and then downsizing).

If the two metrics are similar, then why do they appear to tell different stories? In fact, upon further review they appear to be telling the same story. In figure 1, we see the two scores plotted for each journal. The first row is the type-token scores for each of the four journals, red for institutions and blue for PhDs. The second row is for 1/HHI,  the effective number. In none of the plots do we see the dramatic decrease of heterogeneity in the early years shown in figure 4 of the original essay or the consistently strong increase of heterogeneity that Ganz and Brower argue for. The first row and the second row agree with each other in terms of the general trend most of the time. This is because in our figure 4 and Brower and Ganz’s replication, the four journals are aggregated. When two journals (Critical Inquiry and Representations) come into play in the late seventies and the eighties, the scores are dragged down because, on average, those two journals are less diverse. Hence, the two metrics do give us the same trend once the journals are disaggregated.

So when we pull apart the four journals, what story do they tell? If we run a linear regression model on each of the journals individually, since 1990 there has either been no change or a decline of heterogeneity for both measures (with one notable exception, PMLA for author institutions which has increased). In other words, either nothing changes about our original argument, or things actually look worse from this perspective.

We were grateful to Brower and Ganz when they first shared their thinking about HHI and tried to acknowledge that gratitude, even while disagreeing with their assumptions, in our essay. Understanding different models and different kinds of evidence is, we’d suggest, a central value of scholarship in the social sciences or in the humanities. That is why we discussed the two measures together. But to suggest that the marginal differences between the scores invalidate an entire study is wrong. It is also not accurate to imply that we made this graph a centerpiece of our essay—“its most provocative finding,” in their words.

Consider how we frame our discussion of the time-based findings in our essay. We point out the competing ways of seeing this early trend and emphasize that post-1990 levels of inequality have remained unchanged. Here is our text:

Using a different measure such as the Herfindahl-Hirschman Index (HHI) discusssed in note 37 above suggests a different trajectory for the pre-1990 data. According to this measure, prior to 1990 there was a greater amount of homogeneity in both PhD and authorial institutions, but no significant change since then. In other words, while there is some uncertainty surrounding the picture prior to 1990, an uncertainty that is in part related to the changing number of journal articles in our data, since 1990 there has been no significant change to the institutional concentrations at either the authorial or PhD level. It is safe to say that in the last quarter century this problem has not improved.

In other words, based on what we know, it is safe to say that the problem has remained unchanged for the past quarter century, though one could argue that in some instances it has gotten worse. If you turned the post-1990 data into a Gini coefficient, the degree of institutional inequality for PhD training would be 0.82, compared to a Gini of 0.45 for U.S. wealth inequality. But for Brower and Ganz, this recent consistency is overshadowed by the earlier improvement that they detect. To insist that institutional diversity is improving is, at best, to miss the proverbial forest for the trees. At worst, it’s misleading. Their argument is something like: We know there has been no change to the extremely high levels of concentration for the past twenty-five years. But if you just add in twenty more years before that then things have been getting better.

Their second example about the relative heterogeneity between journals reflects a similar pattern: legitimate concern about a potential effect of the data that is blown into a dramatic rebuttal not supported by the empirical results.

In the one example of PhD heterogeneity, they show that a random sample of articles always has PMLA with more diversity than Representations and yet our measure shows that Representations exhibits more diversity than PMLA. What is going on here?

It appears Representations is unfairly being promoted in our model because it publishes so many fewer articles than PMLA (PMLA has more than twice as many articles as Representations). But notice how they choose to focus on the two most disparate journals to make their point. What about for the rest of the categories?

Interestingly, when it comes to institutional diversity, the only difference that their proposed measure makes is to shift the relative ranking of Representations. What is troubling here is the fact they they chose not to show this similarity when they replicated our findings, which are shown here:

Author                          | Phd

Ours          HHI             | Ours           HHI

PMLA       PMLA         | NLH           NLH

NLH         NLH            | Rep             PMLA

Rep          CI                | PMLA          CI

CI             Rep             | CI                 Rep

In other words, our essay overestimates one journal’s relative diversity. We agree that their example is valid, important, interesting and worth using. But as an argument for invalidation it fails. How could their measure invalidate our broader argument when it reproduces all but one of our findings?

Given both measures’ strong correlation with article output, we would argue that the best recourse is to randomly sample from the pools to control for sample size rather than rely on yearly data. In this way we we avoid the trap of type-token ratios that are overly sensitive to sample size and the HHI assumption of more institutions being an inherent good. Doing so for 1,000 random samples (of 100 articles per sample), we replicate the rankings produced by the HHI score (ditto for a Gini coefficient). So Brower and Ganz are correct in arguing that we overrepresented Representations’ diversity, which should be the lowest for all journals in both categories. We are happy to revise this in our initial essay. But to suggest as they do that this invalidates the study is a gross oversimplification.

3. When Is Error a Problem?

Brower and Ganz’s final major point is this: “We are concerned that the data, in its current state, is sufficiently error-laden to call into question any claims the authors wish to make.” This is indeed major cause for concern. But Brower and Ganz provide little evidence for their sweeping claim.

Brower and Ganz are correct to point out errors in our data set, a data set which we made public months ago precisely in hopes that colleagues would help us improve it. This is indeed a nascent field, and we do not have the same long-standing infrastructures in place for data collection that the social sciences do. We’re learning, and we are grateful to have generous people reading our work in advance and helping contribute to the collective effort of improving data for public consumption.

As with the above discussion about HHI, the real question is, what is the effect of these errors in our data set? Is the data sufficiently “error-laden” to call into question any of the findings, as they assert? That’s a big claim, one that Brower and Ganz could have tested but chose not to.

We can address this issue by testing what effect random errors might have on our findings. This too we can do in two ways. We can either remove an increasing number of articles to see what effect taking erroneous articles out of the data set might have, or we can randomly reassign labels according to a kind of worst-case scenario logic. In the case of gender, it would mean flipping a gender label from its current state to its opposite. What if you changed an increasing number of people’s gender—how would that impact estimates of gender bias? In the case of institutional diversity, we could relabel articles according to some random and completely erroneous university name (“University of WqX30i0Z”) to simulate what would happen if we mistakenly entered data that could not contribute to increased concentration (since we would choose a new fantastic university with every error). How many errors in the data set would be necessary before assumptions about inequality, gender bias, or change over time need to be reconsidered?

Figure 2 shows the impact that those two types of errors have on three of our primary metrics. As we can see, removing even fifty percent of articles from our data set has no impact on any of our measures. The results of gender bias and overall concentratedness are more sensitive to random errors. But here too it takes more than 10 percent of articles (or over 500 mistakes) before you see any appreciable shift (before the Gini drops below 0.8 for PhDs and 0.7 for authors). Gender equality is only achieved when you flip 49 percent of all authors to their opposite gender. And in no cases does the problem ever look like it’s improving since 1990.

But what if those errors are more systematic—in other words, if the errors they identify are not random, but have a particular quality about them (for example, if everyone wrongly included had actually gone to Harvard). So let’s take a look. Here are the errors they identify:

  • 100 mislabeled titles
  • twenty-three letters that should not be considered publications
  • one omitted article that was published but not included because it was over our page filter limit
  • eight articles that appear in duplicate and one in triplicate
  • one mislabeled gender (sorry Lindsay Waters)

First, consider those 100 mislabeled titles. We were not counting titles, but rather institutional affiliations. While they do matter for the record (and we have corrected them; the corrected titles will appear in the revised version of our publicly available data set), they have little bearing on our findings.

In terms of duplicates, all but one duplicate occurred because authors have multiple institutional affiliations. We have clarified this by adding article IDs and a long document explaining all instances of duplicates, which will be included with the revised data.

So what about those letters? Actually, the problem is worse than Brower and Ganz point out. We inadvertently included a number of texts below the six-page filter we had set as our definition of an article. We are thankful that Brower and Ganz have helped identify this error. After a review of our dataset, we found 251 contributions that did not meet our article threshold. These were extremely short documents (one or two pages), such as forums, roundtables, and letters that should not have been included.

So, do these errors call into question our findings? How do they impact the overall results?

Here is a list of the major findings before and after we cleaned our dataset:

 

                                                                       Before                         After

Gini coefficient

     PhD institution.                                         0.816                           0.816

     Author institution                                      0.746                           0.743

Diversity over time (since 1990)

     #cases of decrease                                     3                                  3

     #cases of no change                                  5                                  5

     #cases of increase                                     0                                  0

Journal Diversity Ranking                            PMLA                         PMLA

                                                                      NLH                            NLH

                                                                      CI                                CI

                                                                      Rep                             Rep

Gender Bias (% Women)

     4 Journal Yearly Mean                             30.4%                        30.7%

     4 Journal Yearly Mean Since 2010          39.4%                        39.5%

 

Finally, they say we have failed to adequately define our problem, once again invalidating the whole undertaking:"Wellmon and Piper fail to adequately answer the logically prior question that undergirds their study: what is a publication?"

What is a publication, indeed? And why and how did printed publication come to be the arbiter of scholarly legitimacy and authority in the modern research university? We think these are important and “logically prior” questions as well and that’s why we devoted the first 3,262 words of our essay to considering them. This hardly exhausts what is a complex conceptual problem, but to suggest we didn’t consider it is disingenuous.

So let’s start by granting Brower and Ganz their legitimate concern. Confronted with the historical and conceptual difficulty of defining a publication, we made a heuristic choice. For the purposes of our study, we defined an article as a published text of six pages or more in length. It would be interesting to focus on a narrower definition of “publication,” as a “research article” of a specified length that undergoes a particular type of review process across time and publications. But that in no way reflects the vast bulk of “publications” in these journals. Imposing norms that might be better codified in other fields, Brower and Ganz’s desired definition overlooks the very real inconsistencies that surround publication and peer review practices in the humanities generally and in these journals’ histories in particular. As with their insistence on a single measure, they ask for a single immutable definition of a publication for a historical reality that is far more varied than their definition accounts for. Their insistence on definitional clarity is historically anachronistic and disciplinarily incongruous. It is precisely this absence of consensus and self-knowledge within humanities scholarship––and the consequences of such non-knowledge––that our piece aims to bring to light.

Clearly more work can be done here. Subsetting our data by other parameters and testing the extent to which this impacts our findings would indeed be helpful and insightful. And we welcome more collaboration to continue to remove errors in the dataset. In fact, after the publication of our essay, Jonathan Goodwin kindly noted anomalies in our PhD program size numbers, which when adjusted change the correlation between program size and article output from 0.358 to 0.541.

4. Is Quality Measurable?

In sum, we readily concede that the authors raise legitimate concerns about the quality and meaning of different measures and how they might, or might not, tell different stories about our data. This is why we discuss them in our piece in the first place. We also appreciate that they have drawn our attention to errors in the dataset. We would be surprised if there were none. The point of statistical inference is to make estimations of validity given assumptions about error.

What we do not concede is that any of these issues makes the problem of institutional inequality and gender disparity in elite humanities publishing disappear. None of the issues Ganz and Brower raise invalidate or even undermine the basic findings surrounding our sample of contemporary publishing––that scholarship publishing in these four prestige humanities journals is massively concentrated in the hands of a few elite institutions, that most journals do not have gender parity in publishing, and that the situation has not improved in the past quarter century.

There are many ways to think about what to do about this problem. And here the authors are on even shakier evidentiary ground. We make no claims in our piece about what the causes of this admittedly complex problem might be. “Where’s the test for quality?” they ask. This is precisely something we did not test because the data in its current form does not allow for such inferences. In this first essay, which is part of a longer-term project, we simply want readers to be aware of the historical context of academic publication in the humanities and introduce them (and ourselves) to its current state of affairs for this limited sample.

Ganz and Brower, by contrast, assume, in their response at least, that quality––a concept for which they provide no definition and no measure––is the cause of the institutional disparity we found. They suggest that blind peer review is the most effective guarantor of this nebulous concept called “quality.” They provide no evidence for their claims. But there is strong counter evidence that peer review does not, in fact, function as robust a control mechanism as the authors wish to insinuate. For a brief taste of how complex and relatively recent peer review is, we would recommend Melinda Baldwin’s Making “Nature”: The History of a Scientific Journal as well as studies of other fields such as Rebecca M. Blank’s “The Effects of Double-Blind versus Single-Blind Reviewing: Experimental Evidence from The American Economic Review” or Amber E. Budden’s “Double-Blind Review Favours Increased Representation of Female Authors.”[1]

These are complicated issues with deep institutional and epistemic consequences. It is neither analytically productive nor logically coherent to conclude, as Ganz and Brower do, that because high prestige institutions are disproportionately represented in high prestige publications, high prestige institutions produce higher quality scholarship. It is precisely this kind of circular logic that we hope to question before asserting that the status quo is the best state of affairs.

In our essay we simply argue that whatever filter we in the humanities are using to adjudicate publication systems (call it patronage, call it quality, call it various versions of blind peer review, call it “Harvard and Yale PhDs are just smarter”) has been remarkably effective at maintaining both gender and institutional inequality. This is what we have found. We would welcome a debate about the causes and the competing goods that various filtering systems must inevitably balance. This is precisely the type of debate our article hoped to invoke. But Brower and Ganz sought to invalidate our arguments and findings by anecdote and quantitative obfuscation. And the effect, intended or not, is an argument for the status quo.

[1] See Melinda Baldwin, Making “Nature”: The History of a Scientific Journal (Chicago, 2015); Rebecca M. Blank, “Effects of Double-Blind versus Single-Blind Reviewing: Experimental Evidence from The American Economic Review” (American Economic Review 81 [Dec. 1991]: 1041–67); and Amber E. Budden et al., “Double-Blind Review Favours Increased Representation of Female Authors” (Trends in Ecology and Evolution 23 [Jan. 2008]: 4-6).

Chad Wellmon is associate professor of German studies at the University of Virginia. He is the author, most recently, of Organizing Enlightenment: Information Overload and the Invention of the Modern Research University and coeditor of Rise of the Research University: A Sourcebook. He can be reached at mcw9d@virginia.eduAndrew Piper is professor and William Dawson Scholar of Languages, Literatures, and Cultures at McGill University. He is the director of .txtLAB,  a digital humanities laboratory, and author of Book Was There: Reading in Electronic Times. Yuancheng Zhu is a former PhD in statistics and now research fellow at the Wharton School of the University of Pennsylvania