top of page

How to fix the replication crisis in psychology

Updated: May 11, 2023

What if almost half the things we thought we knew about human psychology turn out to be wrong?

The field of academic psychology has been going through a “replication crisis” for the last decade or so. Many things we thought we knew about human psychology from past research turn out to not be supported by the data when we test for those effects again. One attempt to replicate 100 studies that were published in prominent psychology journals in 2008 found that 40 of the 100 original studies did not replicate. Other replication efforts in psychology have found similar results.

This seems like pretty bad news. After all, research findings in psychology are used in ways that have a profound impact on people’s lives. What if roughly half of the things we’ve been basing those decisions on just aren’t true? How can you know what research to trust if a lot of it doesn’t hold up in subsequent testing? We’re putting forward a new approach designed to improve this situation.

Image generated using DALL•E

We’re excited to announce that Clearer Thinking has launched a new project, Transparent Replications, which aims to address this problem by replicating a substantial fraction of papers coming out in top psychology journals. This project shifts incentives for researchers and journals by publicly rating papers on their transparency, replicability, and clarity. We run replications of studies shortly after they are published and then release our ratings and complete replication reports on our website.

Right now replication attempts by independent teams are rare – from 2000-2010 it seems that only about 1.5% of published papers in psychology were replication efforts, and fewer than half of those were by independent teams. Our project makes replication attempts on newly published papers far more likely. For researchers, the increased likelihood that an independent team will run a replication of their study incentivizes the use of research designs that are more likely to successfully replicate. Additionally, our transparency and clarity ratings incentivize the use of good open science practices, and encourage writing that is clear about what the study results do and do not show. Finally, we aim to celebrate and spread the high quality studies that we evaluate.

If you'd like to hear when our new replications are released (to find out if top new psychology studies hold up in our evaluations), sign up for the Transparent Replications newsletter here. There are even ways that you can get involved in the Transparent Replications project!

In the rest of this article, we'll discuss more about what the replication crisis means, share some of the ways that you can use the work we’re doing at Transparent Replications, and end with a few reasons for optimism.

Why is replication important?

Let’s start with some basics for a moment, because they’re important. Why does it matter if scientific findings replicate? What does it mean if they don’t replicate?

The scientific method is an amazing process - one of the most incredible that humans have ever invented, and a big reason why is that science, if we do it right, is self-correcting. The process of coming up with an idea, testing it, sharing the methods and results, engaging with critical feedback, revising our hypotheses, and testing again is how science produces knowledge.

Two important conditions need to be met this self-correcting process to work in practice:

  1. Researchers need to provide enough information about how they test hypotheses that someone else could run the same test and see if they get consistent results.

  2. Other researchers have to actually carry out those independent tests to find out if the results are confirmed or if they find something different.

Finding inconsistent results does not mean that science itself has failed; when results are not consistent, we have the opportunity to dig in more deeply to figure out what the errors or other causes are and what the anomalies mean. That is how much of scientific progress is actually made.

So, if the scientific method is supposed to work so well, how have so many findings that can’t be replicated made their way into textbooks, formed the foundation for later research, and become parts of our common understanding of human psychology? The answer to this question comes down to the differences between the way science works in theory and the way it is often conducted in practice. Science’s self-correction process isn’t working effectively enough, so false findings are able to become prominent and stick around for a long time.

In practice, even when the information needed to replicate scientific work was available, researchers were not prioritizing replicating published findings because it was not valued in the same way as new research, and was difficult to get published. Scientists have limited time and limited funding; running replication studies was not what they were being encouraged to do with those limited resources. Getting funding for research, getting results published in prominent journals, and advancing in a scientific career required putting forward your own hypotheses. Hypotheses that were novel and results that revealed something unexpected were especially prized. These incentives facing scientists are a big part of the reason that we are where we are now.

Why do results not replicate?

The lack of replication attempts is only part of the story. There is a possible world in which, when researchers did begin to attempt replications of past research, they found that most results did in fact replicate. Sadly, this is not that world. So why is it that so many actual study results have failed to replicate?

There are many reasons why a study might generate a result that doesn’t replicate. Here are a few common ones:

  • All research relying on statistical analyses has a chance of generating a “false positive” result - a result that seems to show a real effect, but is actually a fluke caused by the noise in that particular sample of data. This is, to some extent, an unavoidable aspect of the scientific process.

  • The “file-drawer problem” occurs when researchers test a hypothesis, and find nothing interesting (i.e. null results), so those results get shoved into a file drawer because journals aren’t interested in publishing them. This produces a selection bias on which results are published. A hypothesis may have been tested many times with null results, but because those results aren’t interesting enough to publish, they don’t make it into the literature. When a false positive result occurs on the same hypothesis, that result is likely to be published and appear in the literature as the only test of that hypothesis because the null results are stuck in file drawers so people aren’t aware of them.

  • Studies with small sample sizes can also contribute to results failing to replicate. If a sample size is small, the study may not have enough statistical power to reliably detect real effects. This results in a larger share of the significant results these underpowered studies do report being false positives.

  • On top of that, researchers need to find interesting results in order to get published, and getting published is what it takes to win funding and have a successful career. This means that a lot of researchers are searching for statistically significant results (results where p < 0.05) and some of them are running large numbers of tests (or making minor variations in the tests they run) to find ones where the significance threshold was met. If you think about it, it’s pretty easy to figure out what the problem with this is: the more tests you conduct, the more likely you are to (eventually) get a false positive result. This practice is now known as “p-hacking,” and it has become among the most infamous of a number of questionable research practices (QRPs) that have come under scrutiny in the last decade. It even landed a segment on Last Week Tonight with John Oliver!

  • Another common QRP is called “HARKing,” which stands for Hypothesizing After the Results are Known. If you run a bunch of tests on your data and find a few results that meet the standard for statistical significance, you can then come up with a story to explain what those results mean, and present that story as the hypothesis that you were testing for in your study.

It’s important to note that these QRPs are usually not fraud. Actual fraud, like making up data entirely without even conducting a study, does happen from time to time, but it is fortunately rare. In cases of the much more common QRPs, researchers are often fooling themselves into thinking they have really found something (or they may have some doubts about whether their research would replicate, but choose not to look further).

There are many, many decisions that are made along the way in any research project, and if you’re hoping to get a certain result, it’s easy to make these decisions in a way that biases the results without realizing you’re doing it. These “researcher degrees of freedom” can easily lead even very good scientists down what statistician Andrew Gelman calls a “garden of forking paths.”

How do we fix this replication crisis?

At Transparent Replications we do several things to try to tackle this thorny problem.

1. We focus on replicating recently published studies

There are two main reasons we focus on replicating recently published studies:

First, we’re trying to shift the incentives of journals and researchers. If researchers know that there is a reasonable likelihood that a study of theirs could be selected for a replication shortly after publication in a prominent journal, that encourages the publication of work that is replicable.

Second, we want to make information about the replicability of results available quickly so that people can adjust their confidence in study results right away. This should benefit studies that replicate by increasing the confidence people have in those results. We also hope it will reduce the problem of non-replicable findings gaining traction and becoming the foundation for further research, justification for policy decisions, or the basis for widely-publicized advice.

2. We publish our reports for free on our website

Our replication reports are all freely available on our website because we want to make this information easily accessible to anyone who is curious about evaluating research findings.

3. We rate studies on transparency, replicability, and clarity

We rate studies on their transparency, replicability, and clarity. We chose those criteria because we think they are the most important metrics for improving the trustworthiness, reliability, and usefulness of scientific research in psychology and human behavior. By highlighting those ratings, we hope to shift the incentives of researchers and journals to prioritize those criteria more highly.

what makes a good paper

We ran a replication of study 2 from this paper, which assessed how people’s beliefs about the causes of variation in individual’s financial well-being related to their support for different policy goals governments might try to achieve with economic programs.

Here is how our ratings work:

  • Transparency - The transparency criteria assess how well the paper follows open science standards, including pre-registering hypotheses and tests, and making research materials available to the public so that it’s possible to reproduce results or run a replication study. Pre-registration reduces p-hacking and HARKing because researchers publicly report what their hypotheses are and what tests they are planning to conduct before they have collected data, which means it's harder for them to go on a fishing expedition for results and generate a story to explain those results after the fact. Rewarding research that follows these best practices with a high rating is one way we shift incentives and encourage practices that make research more reliable.

  • Replicability - The replicability rating reports what fraction of the primary hypotheses from the original study matched what we found when we re-ran the study.

  • Clarity - The clarity rating rewards studies for clear presentation of research findings. Too often, papers are difficult to interpret in ways that make it easy for the implications of the research to be misunderstood, or to think that a study proved something that it did not. This can become even more of a problem when research findings are being reported in the press. Writing clearly about what a study did, what the results are, and how those results are connected to the hypotheses and the conclusions is extremely important. We highlight when it is being done well, and give people a heads up when something might be easy to misinterpret.

4. We prioritize making our reports easy to understand

In addition to giving clear ratings at the top of our reports, we want the information about what a study is testing and how the test is being done to be really easy for any interested reader to quickly understand. Sometimes papers describe studies in ways that are hard to grasp, which makes it difficult to interpret the results. Making research more understandable makes it both more accessible and easier to evaluate. All of our reports have a Study Diagram near the top so that you can get an overview of what the study is about at a glance.

Here’s an example of a Study Diagram from our third replication report:

examples of replication crisis

5. We make all the details available

We practice what we preach; we make all of our study materials, data, and analysis code available on our website. We preregister our replication studies, which means that we state clearly and publicly how we are evaluating the studies we replicate and what results would count as a successful replication. We also communicate with the original research teams to seek their feedback and ensure that our materials capture what they did in their original study as accurately as possible. Although the early sections of our reports are focused on making the results accessible quickly and easily, the later sections of the reports offer all of the necessary technical detail for a deeper dive into the methods and results.

And now for some more good news

The good news is, Transparent Replications isn’t alone in working to make social science research better, and more reliable. There is a growing movement supporting “open science.” For example, the Center for Open Science, has set standards for transparency that are catching on with journals, researchers, and funders. The transparency ratings that we use build on best practice concepts in open science that also inform these standards. If you’re interested in learning more about how our project interacts with the broader psychological research and open science communities, check out the discussion with project founder, Spencer Greenberg, on the Two Psychologists, Four Beers podcast. You can also read about the project in Sigal Samuel’s new Vox article, “Lots of bad science still gets published. Here’s how we can change that.

We hope you’ll also take a look at our reports, and that you find them useful. If you do, the easiest way to stay in touch with the project is to sign up for email updates. You’ll be the first to know about what we’re working on, including when a new replication report (about a recent psychology study in a top journal) comes out or when we post a study on a prediction market. (Don’t worry – we won’t email more than a few times a month!)

This is a big project, and getting more people involved who value transparent, replicable, and clear social science research makes it much more likely to be successful. We are looking for people to run replication studies as part of our Replication Scholars program. If you have experience running experiments in the social sciences, and are looking for a deeper way to get involved in this work, fill out our quick interest form and we’ll talk!

We also welcome your feedback. This is a new project, and we know there are lessons to learn and things we will improve along the way. Your input can help us improve faster.

We're excited about playing our part in developing more robust, reliable psychology knowledge built on a solid foundation of transparent, replicable, and clear research!


Colin Rust
Colin Rust

This is great!

One detail I'm a little uncomfortable with: the discrete nature of the results (just "0", "+" or "-"). One issue of course is there isn't really much difference between p=0.04 and p=0.06, but the former would count as replication and the latter not. Another is that we don't necessarily just care about statistical significance: the effect size can be important too and so very different, but apparently statistically significant effects sizes maybe shouldn't count as fully replicating.

It would be nice to see p-values and ideally also effect sizes in a summary table. Looking at your first writeup, you do highlight p-values for those that tests that failed to replicate (but still had the same sign). …

Anders Kuvaas Herting
Anders Kuvaas Herting

I second this. Regarding effect size, you could e.g. preregister a "smallest effect size of interest" based on the original finding and what you would consider practically significant.

bottom of page