In a recent paper titled “Evaluating the replicability of social science experiments in Nature and Science Between 2010 and 2015“, a group of researchers set out to replicate a series of experiments published in prestigious academic journals, employing larger samples and different sophisticated statistical techniques to test their reliability. In particular, this Social Science Replication Project (SSRP) defined how to select the conclusions that would be re-tested (clear quantitative results, sample definition, random choice in case of multiple testable results). It also prescribed that when the first round of tests (Stage 1) failed to replicate the original results, researchers would engage in a second run (Stage 2) with a larger sample and try to find at least partial confirmation. The chosen experiments cover different fields, among which social behaviour, mnemonic attitude, and economic choices.
The replication project confirms about 50% of the expected overall result. The “sign” of the result – i.e. the positive/negative direction of the effect attributed to a certain cause — is confirmed in 62% of the cases (between 52% and 67%, depending on the statistical technique used). Moreover, when confirmed, “the intensity” of the effect amounts to 71-75% of the original result. When disproved, the effect is close to nil (i.e. no causality). Interesting enough, in one case Stage 2 was mistakenly run on an experiment already confirmed by Stage 1, and it turned out that Stage 2 rejected the conclusions reached by Stage 1.
Moreover, Peers were asked about the expected replicability of experiments. It seems that the scientific community had pretty accurate ideas about the outcomes (confirmation rate expected at 61-63%).
At first blush, these results provide plenty of food for thought. Yet, SSRP only tested 21 experiments. Small size is apparently common in this kind of scientific research. Certainly, research time and efforts are expensive. However, one cannot evaluate the quality of scientific research by considering only 21 studies. In truth, the authors are aware of this weakness, and consider that “chance” can have played a role, and that mistakes could have occurred in replicating the experiment protocols (although the authors of the original studies have validated the procedure). That said the problem of reliability of the original experiments and of the replication project remains. More generally, one wonders about the quality of empirical academic research.
In the past, other projects looked into the reliability of empirical research in social sciences. Here is what they found:
– Reproducibility Project: Psychology (RPP) re-tested 100 experiments, with a -rate of replication of the direction of causality equal to 36%;
– Experimental Economics Replication Project (EERP) re-tested 18 experiments, with a rate of replication of the sign equal to 61%;
– Many Labs 1 re-tested 13 studies, with a rate of replication of the sign equal to 77%;
– Many Labs 2 re-tested 28 studies, with a rate of replication of the sign equal to 50%;
– Many Labs 3 re-tested 10 studies, with a rate of replication of the sign equal to 30%.
In brief, these numbers show that in the social sciences at least half of the conclusions regarding the “direction” and the “intensity” of causality is wrong. The SSRP researchers conclude that “[t]his finding bolsters evidence that the existing literature contains exaggerated effect sizes because of pervasive low-powered research coupled with bias selecting for significant results for publications”. They speculate that in aggregate the experimental literature is replicable in somewhere between 35% and 75% of cases. Ouch!
If these conclusions are right, it is manifest that “impact analyses” grounded on current empirical knowledge (economics included) are dubious, to say the least. Granted, some results can be more certain (or plausible) than others. But if the expected confirmation of “scientific truths” is somewhere between 75% and 0%, policies based on the scientific literature are in fact an act of faith.
A problem of “time consistency” is also present. During the SSRP exercise, one of the authors of the original studies tried to replicate five different experiments but failed. In a similar vein, an original author explained the lack of successful replication of his experiment by stating that an “increasing familiarity with economic game paradigms among […] samples may have decreased the replicability of their result”. In other words, the very running of the experiment destroys the phenomenon because people “learn”. This means that “historical” results (statistics is a “historical” science, in Carl Menger’s definition) can tell something that characterises a specific time and context, but that is useless in other periods and other situations. In the long run (whatever this means) there are “no constants, only variables”, as Ludwig von Mises warned. Maybe this helps understand why opposing economic theories persist, despite “the development of science”; and why “economics scientists” are so divided (and continue to make wrong forecasts).
Finally, a positive remark. As related above, peers seemed to have pretty accurate expectations of how SSRP would end up. Like the story of the Plymouth ox, there exists some kind of (informed) “common sense”, which allows people (and scholars) to separate the wheat from the chaff: you might come up with the wrong prediction about an effect, but you can at least guess whether an effect is expected at all. It is not clear whether the foundations of this ability are scientific-formal or based on experience (“dispersed and informal”, to recall the Austrian approach). Yet, we can conclude that you need a number of peers who freely contribute – or coordinate – to take the right direction, and that a guide (planner and organizer) on top of the peers – or instead of them — is not the best solution.