Psychology’s favourite tool for measuring implicit bias is still mired in controversy

A new review aims to move on from past controversies surrounding the implicit association test, but experts can’t even agree on what they’re arguing about

By guest blogger Jesse Singal

It has been a long and bumpy road for the implicit association test (IAT), the reaction-time-based psychological instrument whose co-creators, Mahzarin Banaji and Anthony Greenwald — among others in their orbit — claimed measures test-takers’ levels of unconscious social biases and their propensity to act in a biased and discriminatory manner, be that via racism, sexism, ageism, or some other category, depending on the context. The test’s advocates claimed this was a revelatory development, not least because the IAT supposedly measures aspects of an individual’s bias even beyond what that individual was consciously aware of themselves.

As I explained in a lengthy feature published on New York Magazine’s website last year, many doubts have emerged about these claims, ranging from the question of what the IAT is really measuring (as in, can a reaction-time difference measured in milliseconds really be considered, on its face, evidence of real-world-relevant bias?) to the algorithms used to generate scores to, perhaps most importantly (given that the IAT has become a mainstay of a wide variety of diversity training and educational programmes), whether the test really does predict real-world behaviour.   

On that last key point, there is surprising agreement. In 2015 Greenwald, Banaji, and their coauthor Brian Nosek stated that the psychometric issues associated with various IATs “render them problematic to use to classify persons as likely to engage in discrimination”. Indeed, these days IAT evangelist and critic alike mostly agree that the test is too noisy to usefully and accurately gauge people’s likelihood of engaging in discrimination — a finding supported by a series of meta-analyses showing unimpressive correlations between IAT scores and behavioral outcomes (mostly in labs). Race IAT scores appear to account for only about 1 per cent of the variance in measured behavioural outcomes, reports an important meta-analysis available in preprint, co-authored by Nosek. (That meta-analysis also looked at IAT-based interventions, finding that while implicit bias as measured by the IAT “is malleable… changing implicit bias does not necessarily lead to changes in explicit bias or behavior.”)

So where does this leave the IAT? In a new paper in Current Directions in Psychological Science called “The IAT Is Dead, Long Live The Iat: Context-Sensitive Measures of Implicit Attitudes Are Indispensable to Social and Political Psychology”, John Jost, a social psychologist at New York University and a leading IAT researcher, seeks to draw a clear line between the “dead” diagnostic-version of the IAT, and what he sees as the test’s real-world version – a sensitive, context-specific measure that shouldn’t be used for diagnostic purposes, but which has potential in various research and educational contexts.

Does this represent a constructive manifesto for the future of this controversial psychological tool? Unfortunately, I don’t think it does – rather, it contains many confusions, false claims, and strawman arguments (as well as a misrepresentation of my own work). Perhaps most frustrating, Jost joins a lengthening line of IAT researchers who, when faced with the fact that the IAT appears to have been overhyped for a long time by its creators, most enthusiastic proponents, and by journalists, responds with an endless variety of counterclaims that don’t quite address the core issue itself, or which pretend those initial claims were never made in the first place.

Take this section, in which, referencing a number of papers, Jost writes that,

It is often claimed that the IAT simply measures familiarity with or awareness of cultural stereotypes rather than personal animus… [T]he question of whether implicit attitudes reflect personal preferences as opposed to social and cultural processes is ill-posed. (We also know from decades of research on the “mere-exposure effect” that familiarity breeds liking, so there is no reason to assume that familiarity and attitudinal evaluation should be unrelated.)

This is such a confusing paragraph that it’s hard to know where to start. Jost seems to be arguing that the “mere exposure effect”, the general principle that people like familiar things and people more than unfamiliar things and people, should be applied to the critique that the IAT might be measuring familiarity with stereotypes rather than endorsement of them. By his (apparent) logic, because of the mere exposure effect, people who are more familiar with stereotypes are more likely to endorse them, so it doesn’t matter if the IAT is really “just” measuring familiarity – idea-familiarity and idea-endorsement are inextricably bound.

Of course, if social psychologists really believed this, they would seek to halt educational programmes that teach children about ugly racial stereotypes out of a fear that the children will, on average, become more racist as a result of this exposure. Luckily, in the real world, the evidence suggests this is a rather radical overstretching of the idea of the mere exposure effect. To take one of countless examples, in the US, Democrats are exceedingly aware of the fact that some Republicans view Mexican migrants as disproportionately criminal – that does not make them more likely to endorse that belief themselves. Mere exposure is powerful in some contexts, but not this one.

So the question of whether the IAT measures something that can be fairly called animus, in the sense of being a preference (in this case, an unconscious one) for one group over another, rather than familiarity with stereotypes, is anything but “ill-posed”. For a long time, people have been told that their test score reflects the former – that they have “implicit bias”. Outside of Jost’s confusing paragraph, no one anywhere would suggest that the definition of “implicit bias” is the same as the definition of “awareness of certain stereotypes.” If the test is claiming to measure one thing when it is really measuring the other, or a mix of the two (I’ve always thought it’s more likely the test is measuring a complicated mix of stuff than that it’s measuring A or B or C, full-stop), of course that’s an important issue to resolve.

Jost’s paper also includes at least one rather misleading claim. In the third paragraph of his article, he writes that “In an especially absurd comparison, the IAT was likened to measuring height with a stack of melting ice cubes.” The citation points to my NY Mag article. In fact, Jost mentions height measurement three times in the paper and in the abstract, he critiques “false analogies between the IAT and measures… [like] physical height”.

But I never “likened” the IAT to any measure of height! Rather, in a section of my article explaining to my lay readers the term test-retest reliability, I used ice cubes as an example to illustrate the concept: “A tape measure has high test-retest reliability because if you measure someone’s height, wait two weeks, and measure it again, you’ll get very similar results,” I wrote. “The measurement procedure of grabbing an ice cube from your freezer and seeing how many ice cubes tall your friend is would have much lower test-retest reliability, because different ice cubes might be of different sizes; it’s easier to make errors when counting how many ice cubes tall your friend is; and so forth.” It’s true that later in the article I note that the available evidence suggests the race IAT has low test-retest reliability, but that’s of course not the same as directly comparing — likening — the IAT to a measure of height.

As I mentioned, though, the biggest worry with this new review paper is that Jost is rewriting history a bit. Nowhere is this clearer than when he argues that “As a ‘bona fide’ pipeline used to quantify levels of ‘unconscious racism’ as a fixed property of the individual—or as a diagnostic tool to classify people as ‘having’ racism or sexism (like they might ‘have’ clinical depression)—the IAT is dead. I do not know if any researchers of implicit bias actually conceived of the IAT in these ways, but critics continue to assert that this is our conception (Bartlett, 2017; Mitchell Tetlock, 2017; Singal, 2017). It certainly is not mine.”

But of course evangelists of the IAT have been treating it as a diagnostic tool. Here’s Banaji and Greenwald in Blindspot: Hidden Biases of Good People:

[T]he automatic White preference expressed on the Race IAT is now established as signaling discriminatory behavior. It predicts discriminatory behavior even among research participants who earnestly (and, we believe, honestly) espouse egalitarian beliefs. That last statement may sound like a self-contradiction, but it’s an empirical truth. Among research participants who describe themselves as racially egalitarian, the Race IAT has been shown, reliably and repeatedly, to predict discriminatory behavior that was observed in the research.

Here’s Greenwald, in an email, to Quartz’s Olivia Goldhill last year: “The IAT can be used to select people who would be less likely than others to engage in discriminatory behavior.”

Now, to be clear there have been mixed messages on this front. The quote to Quartz was a serious flip-flop, as I noted at the time, when stood up against the 2015 paper in which Greenwald and Banaji acknowledged the test’s weaknesses. But either way, it’s almost impossible to square Jost’s claim that he isn’t aware of anyone who has claimed the IAT measures “‘unconscious racism’ as a fixed property of the individual[…] or as a diagnostic tool to classify people as ‘having’ racism or sexism” with the fact that two of his coauthors have claimed that the IAT can predict individuals’ likelihood of engaging in discriminatory behavior.  

Being charitable, I suppose there’s wiggle room with the word “fixed”.  But in the initial hyping of the IAT, its evangelists definitely promoted the idea that what it was measuring was, if not “fixed,” at the very least stable – you will not find much circa-2005 coverage of the IAT in which the test’s proponents highlight IAT results as fluid and easily manipulated. In fact, in the [1998] University of Washington press release covering the IAT’s unveiling, the author notes that while “Banaji and Greenwald admitted being surprised and troubled by their own test results, they believe the test ultimately can have a positive effect despite its initial negative impact. The same test that reveals these roots of prejudice has the potential to let people learn more about and perhaps overcome these disturbing inclinations.” Why would this be an issue if the test wasn’t being promoted as measuring something stable that an individual could only “perhaps overcome”?  In 2005, Banaji went further, telling the Washington Post she was ‘deeply embarrassed’ at her test result – but why would she be unless she considered the IAT was measuring something stable about herself? There’s a whole subgenre of anecdotes in which individuals reveal their discomfort at their IAT results. That doesn’t jibe with the idea that it has historically been viewed as a noisy, extremely context-dependent measure.

The goalposts are ever-shifting. Or maybe the better term is motte and bailey, the fallacy of advancing an argument you can’t justify with evidence, and then, when called on to do so, retreating to a much less controversial position. Motte: The IAT predicts individuals’ levels of unconscious bias (or racism, if you will), and therefore their future behavior. Then, after the convincing methodological critiques and underwhelming meta-analyses roll in, bailey: We never said the IAT predicts individuals’ levels of unconscious bias/racism — rather, as Jost puts it, “[I]f the IAT measures a particular attitude at a given time in a specific social context, there is nothing inherently problematic (or unethical) about providing people with feedback concerning that attitude measurement, recognising that it is, after all, only one measurement at one point in time.”

Setting aside that this clearly wasn’t how the test was presented to the millions of people who have taken it on the Harvard Implicit website and in diversity trainings and in other settings,, it’s a confusing claim: What is the “social context” of taking an IAT? It’s hard, in fact, to come up with a more decontextualized social-psychological experimental experience. You’re sitting at a computer, pecking at keys, not engaging in any interaction with another human being. Now, often IAT results are correlated with how people behave in certain (mostly quite canned) social contexts, but the vast bulk of the IAT research does not measure people’s “attitude at a given time in a given social context.” Or that hasn’t been the claim, at least.

The point is this: You can’t have it both ways. You can’t portray a test as revolutionary, mention it in the same breath as the discovery of the telescope, suggest it offers new racial insights on the order of Michelle Alexander’s or Ta-Nehisi Coates’ (as claimed in an email from one of Banaji’s students that she forwarded to the Quartz reporter Goldhill), and claim it does this and that and the other thing – and then pretend you never said those things. It’s just such a worst-case example of science communications, and on a vitally important subject, no less. This whole debate has been quite demoralising.

Post written by Jesse Singal (@JesseSingal) for the BPS Research Digest. Jesse is a contributing writer at New York Magazine. He is working on a book about why shoddy behavioral-science claims sometimes go viral for Farrar, Straus and Giroux.

Article source: