top of page

How to Read an Academic Paper (and Tell Whether it’s Bullshit)

Updated: Jan 7



How good is your bullshit detector? Like anything, scientific literature can vary in quality – but sub-par research can often have a veneer of credibility. What’s more, poorly designed studies with unreliable results are more widespread than many people realize. 


Since the early 2010s, the scientific community has been grappling with what has been termed the replication crisis. A surprisingly large proportion of scientific study results in some fields were found to be hard or impossible to reproduce, meaning that a second research team using identical methods will fail to yield the same outcome. 


Concerns about the replicability of research date back to the late 1960s, but were thrown into the limelight when a now infamous 2011 study claimed to have found evidence of extrasensory perception (or a ‘sixth sense’). Researchers were quick to point out its methodological flaws, but troubled to discover that many of them were common practice in psychological research. In 2015, a survey of 100 psychological studies found that just 39% could be successfully replicated (though, if we take into account that some of those findings were probably false negatives, and others may not have been identical to the original study, the true figure may be more like 60%). 


These revelations are very concerning, but should not provoke a blanket suspicion of all science. Science has, without a doubt, been one of the most effective tools that humanity has invented for figuring out the truth. Unfortunately, it doesn't always function as it should and, even when it does, false positives are inevitable.


A better approach than discarding science as a whole is to get better at detecting signs of poor research that can help you become a more critical consumer of scientific literature. We’ve identified four questions you can ask about an empirical academic paper to help you determine its reliability. 



1️⃣ How big was the effect? 


When you see a scientific result, it's useful to ask yourself: "How big was the effect?"


‘Effect size’ is a way of describing how big a statistical result is, such as the strength of a connection between two things or the size of a difference between groups. A study’s effect size is important for helping you determine whether its results have real-world relevance and how trustworthy they are. 


Real-World Relevance

Sometimes, a study identifies a real effect, but it is not large enough to serve any practical purpose. 


An example of a phenomenon with a large effect size is the relationship between gender and height. The average height difference between men and women is relatively large, fairly consistent across populations and readily observable without doing any complicated statistical analysis. By contrast, a hair loss treatment that prevents hair loss in only 5% of people who take it would be considered to have a small effect size (and would be useless for most real purposes).


Scientists will often report effect sizes using a standardized measure, to make them more easily comparable across studies that differ in design. The most common is called 'Cohen’s d'. For instance, if one group is given a treatment for anxiety and another is given a placebo, Cohen's d would be the difference in anxiety between the two groups (in their anxiety scores) divided by the variability of anxiety scores (i.e., the "standard deviation"). In simple terms, it tells us how much the treatment effect stands out compared to the usual spread in scores.


What counts as a "large effect" can vary based on context, but the following rules of thumb are often used: 


  • d = 0.2: Small effect (corresponding to a correlation of r=0.1). For example, the effect of a study method that raises average test scores by a few percentage points. 

  • d = 0.5: Medium effect (corresponding to a correlation of around r=0.25). For example, the relationship between screen time and sleep quality. 

  • d = 0.8: Large effect (corresponding to a correlation of around  r=0.40) For example, the relationship between height and weight.


A study’s effect size is usually included in its results section or abstract. Here’s an example of what to look out for:


Trustworthiness

So what can you learn about a study’s reliability from its effect size? First, the biggest effect-size-related red flag is the absence of one. There are some exceptions to this rule – effect size reporting was less common pre-2000s, for example. But for at least the past couple of decades, it has been considered standard practice and is even mandated by many journals. If a study has discovered a phenomenon that they say is  "statistically significant" but has a very small effect size, omitting this information may be an attempt to inflate its importance.


You might think that the larger the effect size, the better. This is true in that larger effect sizes are often more relevant and useful. But there are good reasons to be wary of both very small and very large effect sizes. If an effect size is small enough, this may be an indication that it is due to a fluke or a mistake in the data collection. It may also have very little real-world usefulness. If a treatment only causes a slight improvement in symptoms or only works on a very small percentage of people, it may not be worth taking. Similarly, if a psychological phenomenon is proven to have some effect, but that effect is so small that it barely accounts for any real human behavior, it may simply not be interesting.


On the other hand, excessively large effect sizes may strain credulity. If a study claims to have discovered an extremely strong but hitherto undetected effect, this may arouse suspicion. While treatments or effects may be truly incredible in some rare cases, if something sounds too good to be true, there's a pretty good chance that it is.


So, if you see a study claim that a treatment works or that an effect provides an explanation for some phenomenon, check the effect size. Is it big enough to actually care about? Or is it so large as to be implausible?



2️⃣ How many participants were studied? 


In order to collect accurate and useful data, a study will need to use a sufficiently large "sample size". For studies on humans, the sample size is typically just the number of study participants that the research was conducted on. The smaller the effect size a study is trying to detect, the larger the sample size required. You could detect that men are (on average) taller than women with just 15 men and 15 women (as long as they are drawn at random from the population). But to tell that a potential cancer treatment works – where the recovery rate goes from 50% without the treatment to 55% with the treatment – you would need more than 1500 people to get the treatment and 1500 others to be in the control group. 


How can you know if the study you’re reading has used a sufficiently large sample to detect the effect size it reports? This is tricky – some researchers have attempted to provide recommendations for samples that should be used for different-sized effects. But this is not an exact science, and the water is muddied further by the fact that, before they have run an experiment, researchers won’t know precisely what effect size they are looking for. They might make an educated guess based on similar past studies, but the potential for error is clear. In some areas, too-small samples may be systematically undermining the reliability of research. 


We’ve put together some representative numbers for sample sizes you might use in a couple of different research scenarios:


  1. A survey


The table below provides some guidelines for the sample sizes you’d need to conduct a survey estimating what percentage of the population believes something. Assume that participants are simply asked if they agree (1) or disagree (0). When conducting such surveys, what you really care about is the percentage of people that agree in an entire population (everyone in the US, for example) but what you’re actually measuring is the percentage that agree in some subset of these people. This means that there exists a "margin of error," which indicates how close together these two figures really are. The more people surveyed, the smaller the margin of error will be. The margin of error creates a range above and below your sample percentage, where we can be 95% confident that the true population percentage falls somewhere within this range.


For example, if a survey finds that 60% of respondents agree with a statement, and the margin of error is ±5%. This means that you can have high confidence that the true percentage for the full population is between 55% and 65%, because we know that 95% of the time, the true result (for the whole population) will lie within the margin of error of the measured result (for the sample).


The smaller you want your margin of error to be, the bigger the sample you’ll need.

Margin of error (95% confidence interval)

Number of participants (sample size)

±20%

25

±10%

100

±5%

400

±2%

2,500

  1. A comparative study


Here are some approximate sample sizes you’d need when comparing two groups (for example, a group who took a drug and a group who took a placebo):


Effect size

Total number of participants required

Participants required per group (intervention vs. control)

Equivalent to a malnutrition treatment that causes teenagers to end up taller by...

Large (d=0.8)

50

25

2 inches

Medium (d=0.5)

125

65

1.4 inches

Small (d=0.2)

800

400

0.6 inches


A well-known example highlighting the importance of sample size is a 2010 study which claimed that ‘power-posing’  increases confidence and risk tolerance, as well as leading to elevated levels of testosterone and cortisol. This study had just 42 participants. A later study used a substantially larger sample of 200 participants, and found that power-posing did lead to increases in subjective feelings of power (we ran our own study that found the same thing on an even larger sample size!), but no elevated risk tolerance or hormonal changes. Two years later, an extensive literature review of power-posing studies reached similar conclusions – that it does increase self-reported feelings of power, but that the other effects are probably not real. The power-posing saga is an important case study in the misleading conclusions that can result from insufficient sample sizes.


If a study makes a claim but only includes a small number of study participants, be skeptical. A small number of participants can be useful for pilot studies and qualitative research, but it's rare that firm conclusions can be drawn from small sample sizes in empirical research.



3️⃣ How small is the p-value (i.e., how unlikely is a result this extreme if there is no actual effect)?


The p-value of a research result tells us how likely it is to be the result of random chance. All else equal, the lower the p-value is, the less likely it is that the result is a fluke. To be more exact, the p-value is the probability (from 0 to 1) that you'd get a result at least this extreme if, in fact, there was no real effect at all. P-values are one of the most commonly misunderstood concepts in science. 


In most fields, a p-value of 0.05 (or 5%) is considered the threshold for "statistical significance" – that is the point at which the chances of a result being due to chance are considered low enough for the study to be publishable. As with effect size, authors should clearly state the p-value of their study (and if you can’t find one, this should be considered a red flag!). It will usually look something like this:



The lower a p-value is, the better. A very low p-value means that a study’s results would be very surprising if they were due to chance. For instance, when we look at psychology studies where other research teams had attempted to replicate the paper to see if they could get the same result, we found that: 


  • When the original study's p-value was 0.01 or lower, about 72% of the papers replicated.

  • When p was greater than 0.01, only 48% replicated.


Exact numbers will likely differ by field, but when all else is equal, the lower the p-value, the greater the likelihood that a result will replicate if another research team runs the study again.


Conversely, it’s useful to keep an eye out for p-values at or just below 0.05. Since this is generally the threshold for publication, authors will sometimes engage in questionable research practices to artificially lower their p-value so that they meet it. This is known as p-hacking, and research suggests that it is quite prevalent. One 2016 study discovered a suspicious over-representation of p-values at or near 0.05 in Psychology research, one plausible explanation for which is frequent p-hacking. This chart shows how from the 1990s to 2013, the percentage of papers with p-values falling (suspiciously) just below 0.05 increased dramatically, whereas those falling just above the threshold increased a lot less: 




It's important to keep in mind that p-value is not the probability that a result is false (this is one of the most common mistakes people make when thinking about p-values). If a result has a p-value of 0.01, this doesn’t mean there is a 99% chance that it is real. It simply means that there is only a 1% chance of observing a result at least this extreme due to randomness, if there is truly no real effect. So, a low p-value suggests the result is unlikely to be due to chance alone — but it does not confirm that the result is definitely real or meaningful.


So if you see a study claiming to have an effect, check the p-value. If there is no p-value at all, that's a bad sign (unless a less common alternative to p-values was used, such as in Bayesian analysis or confidence intervals). If the p-value is just below 0.05, this warrants some skepticism, as p-hacking may have occurred. The smaller the p-value is, the less likely it is (all else equal) that the result is to be due to random chance, and the more likely it is to replicate if another group of researchers repeats the study. 





4️⃣ Does the study design allow you to reach this conclusion? 


How a study is designed affects the conclusions that can be drawn from it. One important distinction is between experimental and observational studies. In observational studies, researchers collect data without intervening in an environment. In experimental studies, by contrast, researchers change something (typically by randomizing some elements) and observe what happens as a result.  For example, in an observational study, researchers might track the coffee consumption habits of office workers and their productivity levels, while in an experimental study, researchers would randomly assign participants to drink either normal or decaf coffee and then measure their productivity under controlled conditions.


In the sciences, it's typical to claim that only randomized experimental studies (or 'randomized controlled trials') can tell you whether one thing CAUSES another thing. (Philosophical questions about causation notwithstanding.) This is because researchers need to randomize individual variables in order to isolate a causal effect.


To say that X causes Y (for example, that a treatment causes a condition to improve), it's not enough to simply observe a correlation or association between them. The reason is that a correlation could occur due to:


  • X causing Y (the treatment actually works)

  • Y causing X (people who are likely to get better are more likely to start the treatment for some reason)

  • A third variable causing both A and B (those who are wealthier are both more likely to try the treatment AND more likely to improve, but the improvement is not caused by the treatment) 

  • A cyclical relationship between X and Y (those who take the treatment are more likely to improve, and this improvement causes them to increase their dosage of the treatment)


There are many real-world cases of causal conclusions being incorrectly drawn from observational data. For example, scientists in the 1980s and 1990s observed through studies such as the Nurses' Health Study (which tracked health outcomes in over 120,000 female nurses across several decades, beginning in 1976) that women who took hormone replacement therapy (HRT) had lower rates of heart disease compared to those who didn't. In the influential 1985 study, researchers found that postmenopausal women taking estrogen had significantly lower rates of heart disease, which (among other studies in a similar vein) led to widespread adoption of HRT for cardiovascular protection. But in 2002, a randomized trial by the Women's Health Initiative not only contested the cardiovascular benefits of HRT but actually found it increased the risk of heart disease. Some researchers have suggested that women who chose to take HRT tended to be wealthier and have better access to healthcare (rather than the HRT itself being protective) for which there is clear evidence of effects on heart disease. In other words, it may be the case that higher socioeconomic status both causes women to be more likely to take HRT AND improves health – creating an apparent association between HRT and health.


This isn't to say that all benefits of HRT have been conclusively disproved. HRT remains an important treatment for menopausal symptoms, and ongoing research continues to investigate its effects on various health outcomes. The point is that the observational studies hypothesizing the cardiovascular benefits of HRT (such as the 1985 Nurses' Health Study we referenced earlier) were not designed in the right way to provide much evidence for this theory. Yet the idea that HRT protected against heart disease influenced public discourse and prescribing practices.


It isn’t always the case that such misconceptions arise because the authors of a study deliberately misrepresent their findings. They may simply make claims that are not appropriately caveated, and, for example, these claims may be misinterpreted by readers or sensationalized by the media. But even if study authors are careful not to make concrete causal claims, and only to present hypotheses as suggested, not confirmed, by the data they have collected, such nuance rarely survives the dissemination of research into the public sphere.


So, if you see a study cited as evidence that X causes Y, check what kind of study it was. If it wasn't a randomized controlled trial (or a similar design that randomized a variable of interest – or one of a handful of clever tricks to make randomization unnecessary) then it's unlikely to be able to have demonstrated that X causes Y. Remind yourself that it could be that Y causes X, or that some third thing causes both X and Y.



But what about…


So far, we’ve explained four factors that you can use to help determine the reliability of a scientific study – effect size, sample size, p-value and experimental design. There are a few others that people often look to, whose importance we think may sometimes be overstated.


…citation count?


It’s tempting to look at the citation count of a paper (or of its authors) to see how trustworthy the study is likely to be. This isn’t a totally meaningless signal  – if an author has a high citation count it means that their work has been engaged with by a high number of other academics. But of course, not all engagement is good engagement! Particularly infamous papers may even have high citation counts because they have been frequently cited by authors pointing out their flaws. Very sensational papers may also receive more citations – but the most sensational results are not necessarily the most accurate (and actually may be less likely to be true).



…journal reputation?


Journal reputation is another useful but commonly over-weighted signal. There is an underbelly of low-tier journals with very low credibility. If a study is published in one of these, that probably does tell you that it's less likely to be reliable research.  


There is, however, less of a meaningful difference between being published in a mid-tier or a top-tier journal. Some research has suggested that journal rank is a bad predictor of quality. One particularly large study collected data from over 45,000 papers and found that journal rank and citation count did not reliably correlate with replicability and evidential value. Higher-tier journals may be more likely to publish highly novel or surprising results because those are often what get the most attention – but novelty and surprise can both be indicators that a study is less likely to replicate.



…pre-registration status?


Authors can pre-register a study by publicly documenting their research plan before conducting it. Whether a study has been pre-registered can be a very helpful indicator of accuracy because it holds researchers accountable to their original plan. If they make any changes along the way, they need to provide a justification, which helps reduce the possibility of p-hacking or other practices that might lead to misleading results.


On the other hand, the vast majority of papers don't have pre-registration, so this signal can't be used in most cases. Additionally, pre-registered research plans are sometimes vague, and there are well-documented ways of exploiting loopholes in the process.


 

There is no silver bullet for determining whether a study’s results are reliable. Science is incredibly powerful as a process for understanding the world. Rather than dismissing it, we should learn how to identify what's good science and what's bad science.


So, the next time you hear about the results of a study, consider asking these questions:


  1. How big was the effect? Was it large enough to care about, or too large to be believable?

  2. Did it include enough study participants to draw reliable conclusions?

  3. Was the p-value small enough to be confident the effect was not just the result of random chance – and was it suspiciously close to p=0.05?

  4. If the claim of the study is that X causes Y, was it an experiment where they randomized X, or did they merely show that X and Y are associated? 


These questions are powerful tools in your ‘academic bullshit detector.’

bottom of page