How to Tell Strong Evidence from Weak Evidence

Travis M.
3 hours ago
16 min read

Click here to listenHow to Tell Strong Evidence from Weak Evidence

26:22

Short of time? Read the key takeaways.

❓ Not all evidence carries equal weight. When people argue about topics like vegetarian diets, they mix different kinds of evidence (such as survey data, randomized controlled trials, mechanistic reasoning) without realizing that these different types of evidence vary dramatically in reliability and shouldn't be treated as equivalent.

📊 Hierarchies of Evidence rank types of studies by their typical reliability. Systematic reviews and meta-analyses sit at the top, followed by randomized controlled trials, observational studies, animal studies, mechanistic reasoning, and finally anecdotes, which provide the weakest evidence despite being widely cited.

🎲 Randomization is what makes randomized controlled trials uniquely powerful. By randomly assigning participants to groups, they ensure that traits (both measured and unmeasured) average out, allowing researchers to confidently conclude that the intervention being measured, not some outside factor, caused observed differences.

⚠️ Confounders can make weak studies dangerously misleading. Beta-carotene appeared to prevent lung cancer in observational studies, but later randomized trials showed it actually increased cancer rates; healthier lifestyles among vegetable eaters had created a false association.

🔍 Even top-tier evidence has pitfalls worth checking. Meta-analyses can include low-quality studies, combine incomparable research, or reflect publication bias toward positive results, so asking critical questions about their methodology remains essential regardless of their position in the hierarchy.

Imagine four people arguing about whether a vegetarian diet is healthy or not.

One person points to a personal story: “My friend became a vegetarian and their health improved drastically afterward.” Another counters by citing a survey showing no benefit from vegetarianism. The third person mentions that several experts recommend it. Then the fourth brings up a large randomized controlled trial where some people were randomized to be vegetarian for a month.

This kind of argument happens constantly, about all sorts of topics. People bring up different kinds of studies, expert opinions, and anecdotes about “what worked for me,” without necessarily understanding the differences. Without knowing how much weight to give to each.

How much should each type of evidence count?

Most people don't realize that there are well-established, codified ways to decide how strong evidence is. We’ve previously written about one that we call “The Question of Evidence,” and in this article, we turn our attention to another: the “Hierarchy of Evidence.”

The Hierarchy of Evidence is a valuable heuristic tool that (if used correctly) can improve your scientific thinking. This article explains what the Hierarchy of Evidence is and guides you through the nuances that people often miss when thinking about it, including its weaknesses. Once you’ve finished reading, you’ll be able to say why some studies are typically better than others, and you’ll know some key things to look out for when evaluating the strength of scientific evidence.

What is the Hierarchy of Evidence?

To understand the hierarchy of evidence, let's start with a story of scientists getting things horribly wrong.

Certain vegetables naturally contain lots of a pigment called ‘beta-carotene’. A few decades ago, lots of observational studies from around the world found a strong association between consuming higher amounts of beta-carotene and having lower rates of lung cancer. There was even a plausible-sounding explanation for why this would be the case (antioxidants neutralizing free radicals that damage DNA). So, in the 1990s, many researchers believed it had been established that those vegetables protected against lung cancer.

But when a different type of study - randomized controlled trials - were later conducted (such as this one and this one), they found that taking beta-carotene supplements significantly increased rates of lung cancer (and even other cancers). The results were so concerning that one large trial was stopped 21 months early because of “substantial evidence of possible harm.” These findings were also later supported by meta-analyses.

So, what went wrong? The best explanation seems to be that people who consume lots of beta-carotene from vegetables also do other things that are genuinely preventative of certain cancers (e.g., exercise more, smoke less, have healthier diets). So, beta-carotene was correlated with but didn’t cause lower cancer rates, and when people without those other lifestyle factors took beta-carotene supplements, it caused them to be worse off, on average. Higher beta-carotene intake was indicative of being the sort of person who gets less lung cancer, but it wasn't the beta-carotene that was preventing the cancer (in fact, it likely was doing the opposite).

This story illustrates an important fact: if you’re trying to answer a scientific question, observational studies (that simply make measurements without randomizing anything) typically provide you with less robust evidence about the causes of things than randomized controlled trials (which randomize which people do what). In general, some kinds of evidence are typically more robust and less susceptible to bias than others.

That’s the idea behind the Hierarchy of Evidence; it’s an attempt to codify the tendencies of certain kinds of evidence to be more reliable than others. At the top, you have the most (typically) reliable forms of evidence, and at the bottom, you have the least. Here’s a diagram depicting a fairly representative example of the most common, valid kinds of Hierarchies of Evidence you’re likely to see. By the end of this article, you'll understand this hierarchy:

We go into much greater depth down below, but here's the hierarchy in a nutshell:

The Short Explanation

Systematic reviews and meta-analyses are at the very top (typically providing the best evidence) because they synthesize many studies in a way that counteracts problems that individual studies can have.

Randomized controlled trials (sometimes referred to as “randomized experiments”) come next because this type of study can reliably isolate what causes what (through the power of randomization, which we'll discuss later).

After that, we have a variety of types of observational studies. These can discover what's associated with or predictive of what, but struggle to tell us what causes what. For instance, they might discover that eating broccoli is associated with lower mortality, but that doesn't prove that eating broccoli causes a reduction in mortality.

Next in the hierarchy, we have animal studies and mechanistic reasoning. An example of mechanistic reasoning is “we believe X causes Y, which causes Z, and therefore we think this treatment will work.” Both of these approaches take speculative leaps and, therefore, are typically fairly weak evidence. Animal studies (when applied to humans) assume that what's true in that type of animal will be true in humans, which often fails to pan out. And mechanistic reasoning assumes that we understand the mechanisms properly, but proposals for mechanisms often turn out to be wrong.

Finally, we have anecdotes and personal stories. These typically provide very weak evidence for a wide variety of reasons (too few data points, selection bias in what's reported, biases in memory, misattribution, placebo effects, the possibility of lying or exaggeration,, and so on). Yet people rely on this lowest tier a great deal.

It’s worth noting that, although people talk about the Hierarchy of Evidence, the truth is that many have been proposed. One researcher did his PhD on the differences between different Hierarchies of Evidence and has kept a list of the different proposals he’s encountered. To date, he has identified a shocking 195 different versions. Despite all this variation, there are things that different versions tend to have in common, because they genuinely reflect typical differences in evidence strength, and so are still valuable. We’re going to use the one in the diagram above as our guide in this article.

It’s easy to see why the concept of a Hierarchy of Evidence could be extremely useful: it simplifies the process of determining whether some particular piece of evidence is more reliable than some other piece, thereby making all sorts of scientific questions easier to answer – from which medical treatment is best, to deciding which controversial theory to believe.

But of course, any type of study can be useless, no matter where it sits in the hierarchy. If it's fraudulent (e.g., uses made-up data) or very incompetently executed (e.g., far too few participants are used so the effect is unlikely to be detectable), it's likely to be of no value. Hierarchies of evidence thus only apply to well-executed versions of studies. They tell us how reliable well-executed studies of different types are relative to each other.

Evidence hierarchies are typically used in science, but other disciplines have their own versions (such as this one for making legislative decisions). For the remainder of this article, we’ll:

Use the diagram above as a guide to give a longer explanation of each of the most common kinds of evidence that you’ll see in evidence hierarchies, and explain their strengths and weaknesses. Feel free to skip any or all of these, if you like.
Then we’ll wrap up by giving you some advice for how to apply these insights in your day-to-day life, when you encounter scientific claims.

The Long Explanation

Each of the explanations below can be read independently of the others. Feel free to read only the ones you’re most interested in (or none at all, of course). Or, if you like, just skip to the concluding section about how to apply these insights to your life.

Level 1: Systematic Reviews & Meta-Analyses

Definitions:

A systematic review is a comprehensive assessment of the scientific literature on a topic. It involves systematically searching research databases to find relevant publications, assessing their quality, and synthesizing their results in order to draw the most well-supported conclusions possible from the literature as a whole.

A meta-analysis is a statistical method sometimes used within a systematic review, whereby the results of different studies (often randomized controlled trials) are combined to provide more precise and reliable estimates than any one study is able to.

Strengths:

There are good reasons why systematic reviews and meta-analyses are typically the highest point on evidence hierarchies. For example:

They engage with the total (or at least a large) amount of scientific evidence about a question (rather than just one study). So, although an individual study might get an incorrect result due to chance or study-specific biases, meta-analyses reduce this effect by averaging across many studies.

By grouping studies together, meta-analyses are able to detect smaller effects than a single trial.

They can help prevent cherry-picking. Because systematic reviews and meta-analyses aim to include all relevant evidence, they provide some defense against the natural human tendency to have unconscious biases in favor of studies that support our preferred conclusions.

By zooming out and looking at the totality of available evidence, they can reveal patterns or inconsistencies in the literature. This can help to generalize important findings or direct future research.

Weaknesses:

The most important weaknesses of systematic reviews and meta-analyses are things that can often be accounted for and prevented if researchers are careful. As such, we’re addressing these weaknesses by giving you some questions you can ask (when you’re looking at a systematic review and/or meta-analysis) to help you judge whether they’re a problem:

“Are there low-quality studies included?” This question is important because if you put garbage into a systematic review or meta-analysis, you’ll get garbage out. I.e., if the studies being synthesized are mostly bad, then the results of the review or meta-analysis will be too. Good meta-analyses and systematic reviews account for this by first throwing away low-quality studies before doing their analysis.

“Does it make sense to compare these studies?” Even if the individual studies are of good quality, it might not make sense to compare them. For instance, a meta-analysis about “nudges” has been criticized for including studies measuring the effects of things like (a) writing “Dish of the day” next to a vegetarian option in a cafeteria, and (b) sending reminder texts for people to go to sleep at a certain time. It’s not clear how meaningful it is to (for example) take the average of such different interventions and call that the “effect of nudges.”

“Does the review/analysis address the possibility of publication bias?” Because positive results are often considered more publishable than negative results, academic literature can end up with a bias toward positive findings, which has been shown to lead to overestimations of effect sizes. Again, this is the sort of thing that good meta-analyses and systematic reviews will account for.

Level 2: Randomized Controlled Trials

Definition:

Randomized controlled trials (often referred to as just “RCTs”) are used to compare the results of two or more courses of action (referred to as “interventions”). For example, they can be used to study the effect of receiving a new medical treatment versus receiving the best currently existing treatment. To understand this better, let’s break down the term ‘randomized controlled trial’:

Randomized: Participants in this kind of study are randomly assigned to the different courses of action being compared. An RCT is ‘blind’ if the participants don’t know which group they’re in. It’s ‘double-blind’ if the experimenters also don’t know.

Controlled: The intervention being tested is being compared against what’s known as a ‘control’ group (or control condition). The control might be a placebo, the best or standard current treatment, no treatment, or something else. Having a control group helps researchers determine whether the results of the intervention being tested are likely to be caused by the intervention itself or other factors.

This kind of study is known as the “gold standard” of scientific research. To see why, let’s turn to look at their strengths.

Strengths:

The most important strength is that randomization imbues these studies with the superpower to tell (as long as enough data was collected) whether one thing caused another! That’s because assigning participants to random groups means that the (measured and unmeasured) traits of participants tend to average out, so that the groups are (on average) very similar in every important respect except which intervention they receive. For instance, if you randomize who gets treatment A and who gets treatment B, on average, every attribute of the two groups will be the same (even if there are small fluctuations due to chance). This means that if the outcomes differ for the different groups, you can often conclude, to a high degree of confidence, that it was the treatment that caused the effects rather than other random differences between the groups.

RCTs have another nice feature that's more subtle: because researchers have to decide what the interventions are before data is collected, they have less room (compared to some other study types) to (consciously or not) change aspects of the study design in response to patterns they see in the data. This is another reason RCTs tend to be more reliable than other types of evidence.

Weaknesses:

Weaknesses of this type of study include:

They tend to be very expensive to conduct, especially if you want large sample sizes.

Sometimes they can’t be conducted (e.g., because randomization is impossible or immoral, such as randomizing some parents to be harsher with their children in order to see the impacts of parental harshness on development)

They tend to have very strict inclusion criteria, which means that people like children and pregnant women often can’t participate. This, in turn, means that the groups participating in the study might not be representative of the wider population.

Levels 3 and 4: Cohort Studies

Definitions:

Cohort studies select a group of people (a “cohort”) based on some exposure that they have in common (such as whether they smoke) and look to see what happens or happened to them over time, to investigate whether there's a relationship between the common exposure and specific outcomes (such as whether they develop lung cancer).

A retrospective cohort study is one that does this using existing data (e.g., census data or medical records), whereas a prospective cohort study is one in which the people you’re studying haven’t already developed the outcomes you’re looking at – so, instead of looking at existing data, you track the participants over time and see what outcomes they develop.

Strengths:

Cohort studies are particularly good for studying rare outcomes or outcomes that develop a long time after exposure.

Retrospective cohort studies have the added benefit of being cheap and efficient to conduct (since they involve using data that has already been collected), which also means they can often have very large sample sizes, but are overall weaker than prospective cohort studies because they are more prone to some biases and researchers have less freedom to correct for biases.

Weaknesses:

The main weakness of both types of cohort study (as well as all other kinds of observational studies) comes from potential “confounders.” A confounder is a variable that influences both the outcome(s) you’re measuring and the potential cause(s) that you’re investigating, creating a misleading impression about the relationship between them. Confounders can easily give the false impression that there is (or isn’t) a causal relationship between the things you’re looking at.

For example, imagine that a study finds that people who carry lighters with them are more likely to develop lung cancer. It would be a mistake to conclude that carrying a lighter itself directly causes lung cancer. The real explanation is that smokers are both (a) more likely to carry lighters and (b) more likely to develop lung cancer. If the study didn’t also measure whether participants smoked, the researchers would not be able to control for that, so smoking would act as a confounder in that study, creating the false impression that there is a causal relationship between carrying a lighter and lung cancer, even though carrying a lighter itself is not causing the cancer. The problem is that smoking (the confounder) influences both whether someone carries a lighter and whether they get lung cancer, thereby distorting the relationship between lighters and lung cancer.

When observational studies that attempt to measure whether one thing caused another, there's always the possibility of some unanticipated confounder out there, which would undermine the conclusions of the study. RCTs largely solve this problem through randomization. Because participants are assigned to groups randomly, potential confounders (such as age, income, whether they smoke, personality traits, etc) should be distributed similarly across the groups. This means that, unlike in observational studies, these factors are not systematically linked to which treatment participants receive. As a result, differences in outcomes between the groups can much more plausibly be attributed to the treatment itself rather than to confounders.

Level 5: Case-Control Studies

Definitions:

Case-control studies take a group of people based on some outcome they share (e.g., they each developed lung cancer) and look backward to see whether, in that group, there is a relationship between that outcome and some common trait or exposure.

You can think of case-control studies as kind of like the reverse of cohort studies; where cohort studies select participants based on a common exposure and look to see whether they have similar outcomes, case-control studies select participants based on shared outcomes and look to see whether they have similar exposures.

Case-control studies are used to identify possible risk factors or causes of the outcome being studied, but they can’t reliably establish causation by themselves. For that, randomized controlled trials are needed.

Strengths:

Typically, these studies are relatively cheap and easy to conduct.
Since these studies can examine lots of different traits or exposures, they can find multiple relationships with the outcome of interest. They’re a good starting place for identifying possible causes or risk factors for the outcome being studied.
Because participants are selected based on having a particular outcome, you need relatively few data points to study even very rare outcomes.

Weaknesses:

Since the groups in case-control studies are selected based on having a particular outcome, this kind of study can’t be used to measure how common that outcome is.
Often, these studies are conducted by asking participants questions about their exposures. This opens these studies up to problems from people misremembering, lying, exaggerating, and so on.
As with cohort studies, there is no way to eliminate the possibility that there’s a confounder out there, which could undermine causal claims from the study.

Level 6: Cross-Sectional Studies

Definition:

Cross-sectional studies, such as surveys, take a snapshot (or cross-section) of a population at one moment in time, by selecting a sample of that population and collecting data about how they are at that moment. For example, such a study might take a snapshot of the US population by asking 1000 US residents questions about their health and their work.

Strengths

Typically, these studies are relatively cheap and easy to conduct.
They can easily be used to examine the relationships between many things at once.
These studies are useful for estimating the prevalence of things within a population, if they use a representative sample.

Weaknesses

As with all observational studies, cross-sectional studies cannot establish causation and cannot rule out the possibility of unaccounted-for confounders.
They can't measure changes in populations over time.

Level 7: Case Reports, Case Series

Definition

Case reports and case series are summaries of a small handful of individual cases, documenting interesting or noteworthy features. Case reports tend to cover just an individual case, whereas case series would typically cover a handful of cases.

Strengths

Case reports and case series are particularly useful for recording brand new phenomena (e.g., new diseases, new side-effects of a medication) and are very helpful in education, where they can act as richly detailed individual examples of more general phenomena (e.g., doctors use them to practice diagnosis).

Weaknesses

Their small sample size and lack of a comparison group mean that it is not possible to confidently generalize from their findings.
They have a high risk of bias, due to things like lack of randomization, lack of blinding, bias in who is selected, and so on.

Level 8: Animal Studies, Mechanistic Reasoning

Definitions

Animal studies test the effects of an intervention or exposure on non-human animals. They often use randomized controlled trial methodology, but they provide weak evidence about what would happen in humans because humans are simply very different from other animals, and so animal studies often fail to generalize. However, there are some cases where animal studies are more reliable as sources of information about humans (e.g., once it has been well proven that a certain mechanism in that type of animal is the same as in humans).

Mechanistic reasoning involves offering a speculative hypothesis about how something works, derived from our current scientific understanding of related mechanisms. For example, consider the statement: “Randomized controlled trials have shown that sleep deprivation increases inflammation, and that inflammation is linked to depression. So there’s strong evidence that improving sleep will reduce depression.” This might look like it’s a claim from the second-highest tier of the evidence hierarchy (it invokes randomized controlled trials), but it’s actually engaging in mechanistic reasoning about the results of strong evidence. So, it’s actually much less robust than it might look.

Strengths

Animal studies: Researchers more strictly control things like non-human animals’ environments, diets, genetics, and so on, which makes it easier to study some interventions and effects.
Animal studies: Animal studies often use randomized controlled study methodologies, so they can provide strong evidence of causation (within that type of animal).
Mechanistic reasoning: Offers explanations of why things occur, rather than just pointing out that relationships do occur.
Mechanistic reasoning: may not require collecting data.

Weaknesses

Animal studies: The main limitation to the usefulness of animal studies is simply that non-human animals have very different biology from humans, and so effects seen in non-human animals often don’t occur in humans.
Animal studies: Non-human animals also cannot consent to being tested on, and so there are ethical considerations that make animal studies controversial.
Mechanistic reasoning: Mechanistic reasoning is typically untested and speculative, and proposed mechanisms very often turn out to be incorrect. It can also appear much more robust than it actually is.

How to apply this in practice

It’s quite common to see people in the self-help industry making claims about what’s good or bad for you, based on very low-quality evidence. Entire product lines, health fads, and grandiose advertising claims have been based on a small handful of animal studies, for instance. And you don’t have to look far to find podcasters proclaiming that some diet or activity is good for you because a single study found that it reduced the rates of strokes in rats with heart disease (or something equally ungeneralizable). Hierarchies of evidence can help you by giving you useful heuristics to use when you’re looking at the evidence for claims like those. Typically, if a claim is based on a recent systematic review and meta-analysis, then you can be more confident in its veracity. But if a claim is based on a few animal studies, some mechanistic reasoning, or a few case reports, you can be much less confident.

However, as we have already said, any kind of study can be conducted poorly (e.g., using fraudulent data or a badly-designed methodology). But as long as you are dealing with well-conducted studies, the hierarchy of evidence can usefully guide how much weight to give different forms of evidence.

If you enjoyed this article, you might enjoy trying the free quiz we made in partnership with 80,000 Hours, called “Guess Which Experiments Replicate”. It challenges you to guess which social science experiments (published in top journals) replicated successfully and which did not - just from reading a brief description of their results. Think you’re up to the challenge?

Launch “Guess Which Experiments Replicate”!