It’s getting increasingly difficult for replication-crisis sceptics to explain away failed replications

The Many Labs 2 project managed to successfully replicate only half of 28 previously published significant effects

By guest blogger Jesse Singal

Replicating a study isn’t easy. Just knowing how the original was conducted isn’t enough. Just having access to a sample of experimental participants isn’t enough. As psychological researchers have known for a long time, all sorts of subtle cues can affect how individuals respond in experimental settings. A failure to replicate, then, doesn’t always mean that the effect being studied isn’t there – it can simply mean the new study was conducted a bit differently.

Many Labs 2, a project of the Center for Open Science at the University of Virginia, embarked on one of the most ambitious replication efforts in psychology yet – and did so in a way designed to address these sorts of critiques, which have in some cases hampered past efforts. The resultant paper, a preprint of which can be viewed here, is lead-authored by Richard A. Klein of the Université Grenoble Alpes. Klein and his very, very large team – it takes almost four pages of the preprint just to list all the contributors – “conducted preregistered replications of 28 classic and contemporary published findings with protocols that were peer-reviewed in advance to examine variation in effect magnitudes across sample and setting.”

Many Labs included 79 samples of participants tested in-person and 46 samples tested online, and of these, 39 were from the U.S. and 86 came from a variety of other countries. Among the previously published findings the researchers tried to replicate were: one in which study participants who read about structure in nature said they were more likely to pursue their goals than those who read about randomness in nature (as if exposure to structure, even if unrelated, somehow motivates us); a famous example of a “framing effect” by the behavioural-economics pioneers Amos Tversky and Danny Kahneman, in which respondents were more willing to drive further to get a discount on a cheap item than to get an equal-sized discount on an expensive item (which doesn’t make sense, “rationally” speaking, when the discounts are the same size); and another famous finding, this one from a team led by the leading social psychologist Jonathan Haidt, that showed that, among survey respondents, “Items that emphasised concerns of harm or fairness… were deemed more relevant for moral judgment by the political left than right.”

Overall, 15 of the 28 attempted replications “worked,” in the sense of delivering the same finding in the same direction at a statistically significant level (p .05). When the threshold for significance was bumped up a couple orders of magnitude to p .0001 – a stricter standard some researchers have advocated as a countermeasure against questionable research practices like p-hacking – that number dropped to 14. So in total, only about half the studies replicated, and on average, the effect sizes were significantly smaller. This is not encouraging. (The study about nature-stories and goals failed to replicate, but the Tversky/Kahneman and Haidt ones did replicate, albeit with smaller effect sizes than in the originals.)

Many Labs 2 was designed to address some of the problems, perceived and real, with previous replication efforts. After the Open Science Collaboration published the “Reproducibility Project”, which was only able to successfully replicate about 40 percent of 100 then-recent papers in August 2015, for example, a team – including the famous social psychologist Dan Gilbert (a replication-crisis sceptic) and the leading quantitative social scientist Gary King – argued that there were statistical errors in that effort, and also that in some cases the replicators hadn’t followed the same procedures, or used the same sorts of samples, as the original experimenters. (The whole thing got rather tangled and included responses to responses to responses – you can read my writeup for New York magazine here.)

The question of whether replications are “close enough” to the original is especially important. When replications differ enough from the original studies, this introduces legitimate methodological concerns into the equation and gives those whose work fails to replicate an “out” – “You weren’t really replicating my study – you changed too much stuff.” But researchers in Many Labs 2 were more careful to follow the original studies quite closely. As a result, they argue, “variability in observed effect sizes was more attributable to the effect being studied than the sample or setting in which it was studied.” Specifically, “task order, administration in lab versus online, and exploratory WEIRD versus less WEIRD culture comparisons” – that is, whether the experiment participants were western, educated, and from industrialised rich democratic countries (a major concern in psych research, where it’s often much easier to find WEIRD study participants than non-WEIRD ones) – all failed to account for much of the differences in effect strength observed in the studies. This partially undercuts the idea that these sorts of differences might have mattered or account for failures to replicate, and it also shows that psychology can be conducted in a robust way such that the findings – whether positive or negative – are not unduly influenced by the circumstances.

Moreover, explained Brian Nosek, head of the Open Science Center and a coauthor on the paper, on Twitter, this time around researchers “minimised boring reasons for failure. First, using original materials Registered Reports all 28 replications met expert reviewed quality control standards. Failure to replicate not easily dismissed as replication incompetence.” He also pointed out that the “Replication median sample size (n = 7157) was 64x original median sample size (n = 112). If there was an effect to detect, even a much smaller one, we would detect it. Ultimate estimates have very high precision.”

Of course, a single failed replication – even a big, robust one – should not cause us to confidently rule out an effect any more than a single positive result should cause us to believe it is definitely real. But this big, impressive effort advances the replication conversation in two important ways: It adds to the pile of evidence that there is a replication crisis, and it offers a useful, replicable (sorry) set of guidelines for how to conduct rigorous replications that actually measure what experts are interested in, rather than accidentally sweeping up other stuff.

Many Labs 2: Investigating Variation in Replicability Across Sample and Setting

Post written by Jesse Singal (@JesseSingal) for the BPS Research Digest. Jesse is a contributing writer at New York Magazine. He is working on a book about why shoddy behavioral-science claims sometimes go viral for Farrar, Straus and Giroux.

Article source: