Fundamental Concepts of Statistics, Explained Through Disaster

Spencer Greenberg & Markus Over
12 minutes ago
21 min read

We are surrounded by data. More and more of our world is getting measured, captured, tracked, and analyzed. Ideally, this growing body of data helps us make interesting and useful discoveries, but that requires knowing how to interpret it. Unfortunately, most people have very little familiarity with data science, which limits their capacity for understanding. This article gives you some of the tools to become an exception.

We're going to break down some of the most important basic concepts of data science and statistics that are important to understand, whether you're a seasoned data wiz or you've never touched a spreadsheet in your life (and don't plan to). We'll explain basic (yet important) concepts such as mean, mode, standard deviation, p-values, regression, and more.

Knowing these concepts can help you better understand research, catch misleading statistics, and make informed decisions about everything from health to personal finance. More than that, they let you go beyond headlines and third-party takes, allowing you to go to the source and make your own observations about the data.

To illustrate each important concept, we'll walk you through an analysis of data from one of the most famous disasters in human history - literally answering questions of life or death.

We've also built something to make data science more accessible to everyone. We're excited to introduce an ambitious new tool, which you can use for free, right in your browser: Hypothesize.io. It makes data science easy by helping you run many types of analysis on data and create helpful visualizations, regardless of your level of experience. It also enables you to share your analyses with others while keeping your data secure. Note that Hypothesize provides the smoothest experience on a laptop or desktop PC and is not optimized for mobile devices. To try it, click here:

Try Hypothesize

To make this article interactive, we've set it up so you have the option to follow along in Hypothesize. Use this link to open a project with the dataset we'll be working on, and it will guide you step by step through the process. Of course, if you prefer, you can also try Hypothesize with any other dataset you already have.

Data of a Tragedy

The story behind the data is one that shocked the world. On the frigid night of April 14, 1912, lookouts in the crow's nest of RMS Titanic peered into the moonless darkness 37 meters above the North Atlantic, scanning the night for obstacles. It was the ship's fourth night at sea, and the "unsinkable" vessel was cutting through calm waters at nearly full speed. Given the limited visibility, the two lookouts on duty at the time had requested binoculars multiple times, but to no avail. At 11:40 PM, crew member Frederick Fleet spotted a massive shadow materializing directly ahead: a towering iceberg. "Iceberg, right ahead!" came his shout to the bridge, but the frantic attempt to reverse engines and turn hard to port came too late. The collision lasted less than a minute, but it breached the hull below the waterline, leaving a 300-foot-wide gash. By 2:20 AM, the ship had vanished beneath the North Atlantic waves, taking 1,514 souls with it, while 710 survivors found refuge in lifeboats.

This tragedy created one of history's most complete records of a maritime disaster. The meticulous documentation of passengers and crew members offers a chance to understand the factors that influenced survival during those critical hours. By examining the data, we get a snapshot of that period of history that can help us understand it better. Who did survive the sinking of the Titanic? Can we meaningfully predict the survival of individuals based on demographic factors alone?

Later in this post, we'll explore this question in greater depth, uncovering what meaningful patterns there are that can help us understand the event and that time in history. But the first step in data science is to understand the data you are working with.

This analysis will draw on the titanic3 dataset (courtesy of the Vanderbilt University Department of Biostatistics) that we've adjusted slightly for easier usage. The dataset contains detailed information on 1,309 passengers from the ship's ill-fated maiden voyage. For each of them (with one exception we'll get into), we have the following data:

Passenger Name: the passenger's name (which we'll ignore)
Biological Sex: the person's biological sex, recorded as either M or F
Age: the age (sometimes as a whole number, sometimes with a fractional one)
Ticket price: ticket price of the passenger
Ticket class: this refers to the ship's three ticket classes, recorded as 1 (for First Class), 2 (for Second Class), or 3 (for Third Class). Comfort and cost descend from First to Third
Survived: recorded as 1 for those who survived and 0 for those who passed away

This is what the variable summary looks like in Hypothesize (which, again, you can try out yourself here):

To build a better sense of the dataset, it helps to begin with a simple overview. One natural question is: who was actually on the Titanic? Looking at age gives us a first glimpse into the people behind the numbers. Were the passengers mostly concentrated around one stage of life or spread broadly across generations?

Notably, the Age data in our dataset is not quite complete: as you can tell from the table above, the "Value count" of Age is only 1046, meaning that for the remaining 263 passengers in the dataset, the age was not known at the time this dataset was created. Nonetheless, it's a good starting point for an initial look into the data.

If you're following along in Hypothesize

To find out more about the ages of these 1046 passengers, click New Analysis and select "Understand a Variable". After progressing through the remaining sections, you'll end up on the Results page. The linked tour will guide you through all these steps.

The first thing we notice is the age distribution plot:

This plot from Hypothesize illustrates the age distribution among Titanic passengers. Each green dot on the x-axis represents an individual passenger, but with so many data points and overlapping green dots, the pattern is hard to interpret based on these alone. To help with this, a light blue distribution overlays the chart, showing how common each age is relative to others. The precise y-axis values aren't our focus - it's the overall shape that matters, revealing which age groups were more or less prevalent. The higher the blue curve for a given age, the more passengers of that age were present on the Titanic.

For example, most passengers seem to fall between 15 and 50 years old. There are a few older individuals, and a noticeable bump under age 15, likely reflecting families traveling with children.

While this distribution tells us a lot, it's too detailed for many practical uses or clear communication. That's why data scientists often summarize data like this with a single number - which leads us to our first key insight: Every question you ask about data requires you to ask conceptual questions, to figure out what you're really asking and what you really want to know. When we ask "How old were people on the Titanic?", we might mean many different things. Hypothesize presents several possible answers in the Statistics table beside the plot:

Most notably, it reports:

Mean: 29.9
Median: 28
Mode: 24

Each of these numbers provides an answer to the question of which age is most representative of passengers on the Titanic, but they do so in different ways. Each has its own use cases. You've probably heard of mean, median, and mode before, but they each contain nuances that are crucial to understand. Could you tell, for instance, under which circumstances which of the three should be used? These are our first three critical data science concepts that are important for just about everyone to understand:

The mean

Sometimes called just the "average", this is the sum of all values divided by the number of values. This is by far the most commonly used of the three. One way to understand it is as a way of spreading the total amount of something evenly across a population.

One reason this is useful is that it can be used to make estimates for any size population. For instance, if you know the mean ticket price of 100 people who boarded the Titanic is $100, you can use that to estimate how much it would cost for 1000 people to buy tickets ($100 x 1000) or for 5 people ($100 x 5).

In our case, the mean age tells us how many years of age there are, per person (which is a bit of an odd concept when we're talking about age). In more technical terms, the mean is the value that minimizes the sum of squared differences from all points - it's the "closest fit."

But there's a drawback to using the mean: it's sensitive to outliers. While outliers are typically not a huge issue with metrics such as age, where the range is limited, suppose there was a technical error, listing a single 10-year-old passenger as 10,000 years old. In that case, the mean of the whole distribution would jump from 29.9 to 39.4 - all because of a mistake in a single data point!

When to use it: when you want to scale a property up or down to a different population size (for example, estimating the Titanic's potential revenue for different numbers of ticket sales), or when you need an average that reflects the total amount (such as comparing per-person wealth across an entire country).

The median

This is the "middle" value that you would obtain by sorting all your data points and then choosing the central one. In other words, given that the median age is 28, that means that about half the people on the Titanic were younger than 28, and about half were older than 28. One could say that the median gives you a very "typical" value, in the sense that equally many values will be below it as are above it. Visually, you can think of the median as the point on a distribution plot where a vertical cut would divide the area (light blue in the plot above) in half, with 50% on each side.

Unlike the mean, the median is highly robust to outliers: if one passenger were accidentally listed as 10,000 years old instead of 10, the median would remain unchanged as 28 in our dataset.

Mathematically, while the mean is the special number that's closest to the center in terms of minimizing the squared distance to all the values, the median is closest to the center in terms of minimizing the absolute value of distance to all other values. These are two distinct but very reasonable ways of looking at what number is "closest" to all the values.

When to use it: when you want a representative middle point that isn't distorted by extremes. A classic case is income: the median better reflects what most people earn than the mean, which can be pulled upward by a small number of very wealthy individuals.

The mode

This is the most common value. In this case, the mode being 24 years old means there were more 24-year-olds on the ship than any other age. On a plot like the one above, the mode is simply the highest point (where the blue reaches its maximum height).

Like the mean and median, the mode can also be thought of as the number that's closest to all the values, except this time, instead of using the squared distance or absolute value of the distance, the distance is based on exact matches only. In other words, the mode is closest to all the other values in the sense that it exactly matches as many of them as possible.

When to use it: when you care about the most common value. For example, when a shoe store decides how to stock their shoes (or what to advertise with) they may care much more about the mode than the mean or median.

Each of these three measures - mean, median, and mode - tells us something about the "center" of the distribution. Which one you use can really matter. For instance, if we think about household incomes in the U.S., taking data from 2019, the mean is $98,088, the median is $68,703, and the mode is around $30,000. Here, the mode, median, and mean are noticeably far apart, unlike in our Titanic dataset, where they were much closer. As a rule of thumb, these three measures align more closely the more symmetric and unimodal (meaning it has only one peak, rather than several) a distribution is. Household income, however, is often highly skewed because there is no real upper limit - it can grow orders of magnitude beyond what is typical. This creates what is known as a "heavy-tailed" or "long-tailed" distribution, where a significant share of total income lies in the "tail", far away from the bulk of the distribution.

A surprising mathematical fact (for the mathematically inclined)

We mentioned above that mean and median minimize the sum of squared distances and absolute distances, respectively. But there's actually a mathematical unification of all three concepts, including the mode: as it turns out, they are the different values of m that minimize the formula:

For different values of p: 2 for the mean, 1 for the median, 0 for the mode.

Here, x_i represents all the data points (such as the different age values).

This also illustrates how there are actually far more - in fact, infinitely many - ways to summarize a distribution as a single number: you could use any other exponent besides 0, 1, or 2, to find the "closest fit" of a distribution. They might just end up being less intuitive and probably less useful for most practical use cases.

However, to fully understand a distribution, it's not enough to know where its center is - we also need to know how wide it is. The three most common ways to think about how wide a distribution is are:

The range

This is the difference between the highest and lowest values for the thing you're looking at in your dataset. In this case, that means it's the difference between the oldest and the youngest person's age. The minimal and maximal ages from the dataset are 0.17 and 80, respectively, so the range would be 79.83. This can be good to know, but by definition, it is strongly affected by outliers. Imagine our oldest person was 80, but the second-oldest was only 55. In that case, our range would be quite misleading!

The standard deviation

This is a much smoother representation of the range's "width", which takes all data points into account. It can be interpreted as, roughly, the average distance that data points have to the mean. (Technically, it's the square root of the sum of squares of distances to the mean.)

For our Age distribution, the standard deviation is 14.41, which gives us a rough idea that the majority of passengers are somewhere in the range of 29.9 (the mean) ± 14.41 years. We know that, mathematically, for a real normal distribution, 68% of data points would be within one standard deviation of the mean - in our case, this is not quite true, as the Age curve is not a normal distribution, but the standard deviation is still a useful summary: it captures overall variability in a single number, letting us compare the spread between different datasets or variables, even when they are not perfectly normal.

The interquartile range (IQR)

This is the spread of the "middle 50%" of values. If you sort all passengers by age and divide them into four equally-sized groups (quartiles), the IQR is the range between the lower and upper quartiles - essentially, the span of the two middle groups. In our dataset, the IQR is 18, meaning that half of the passengers fall within an 18-year age range, from 19 to 37 (a range of 18 years around the median). It's handy because it zooms in on the heart of the data, ignoring the extremes that can stretch or skew other measures of spread.

Exploring Hypotheses

Let's return to our original question: Who survived on the Titanic?

One hypothesis is that younger people were more likely to survive, perhaps because their safety was prioritized. But the opposite is also plausible: maybe young people were less able to fend for themselves and therefore less likely to survive? How would we know if either of these hypotheses were true?

We can look at the association between age and survival. By starting a new analysis in Hypothesize, selecting "Study the association between pairs of variables" and then selecting options accordingly, we obtain the following plot, which shows an approximation of age (on the x-axis) against survival rate (on the y-axis). The green data points show each individual person:

Based on this line, it appears that the survival of very young people, at least up to an age of about 10, was indeed above-average in our dataset. The youngest group seems to have had about a 70% chance of survival, compared to older individuals who had closer to a 40% chance. But can we quantify the overall effect of age on survival across all age ranges?

One way to do so is to compute the correlation between age and survival.

Correlation

We've written about correlations in detail before, but in short, a correlation is a general measure of the association between two variables. It quantifies to what degree one moves with the other (so to speak). A correlation of 0 indicates that there is no (linear) relationship between the two variables, whereas a correlation of 1 (the maximum) means that increasing one variable always leads to a fixed increase in the other. A negative correlation means that while one variable increases, the other one decreases, on average. So, if the hypothesis that younger people had a higher survival rate was true, we would expect the correlation between age and survival to be negative (because the lower the age, the higher the survival rate).

Hypothesize reports for us the correlation of -0.056. The fact that this correlation is pretty close to 0 tells us that the association is quite weak across the full age range, despite the spike in survival rates for the youngest cohorts.

But is this small negative correlation just a fluke of this particular dataset? Or is it the result of a genuine effect?

The p-value

Even if there was no actual effect of age on survival, we might occasionally get correlations of this size (-0.056) simply due to chance alone. For that reason, it's important to estimate the likelihood of finding a result at least as extreme as this one, assuming the relationship between the two variables that we're testing for isn't real. That's what a p-value is. It tells us how likely we are to find a result at least as extreme as the one we've measured, simply due to chance, when there is no actual relationship.

P-values are probabilities, which means they range from 0 to 1. The closer they are to 0, the less likely an observed effect is to have occurred as a result of random chance alone.

It's very common to misunderstand what a p-value really tells you, even among scientists who can state the technical definition. It doesn't tell you how likely it is that the effect you've measured is real, or how likely it is that your hypothesis is true. So let's say it again: a p-value tells you how likely it is that you'd get the result you have, by random chance, assuming the effect you're testing for isn't real.

Of course, this also means that lower p-values mean that your data would be rare if there were no effect. That means lower p-values are some evidence of a real effect - but we need to be careful here to not conflate the true meaning of the p-value (the chance of observing a certain effect size given there is no real effect) with the question of how likely the observed effect is to be real - the latter, unfortunately, although usually the more interesting question, is not straightforward to calculate.

So, what is the p-value of our observed correlation of -0.056? This can be seen in the table:

The p-value of 0.073 is pretty close to 0! This means, assuming there was no effect, we'd find a correlation that strong in only one out of about 14 cases (1 / 0.073).

Statistical Significance

It is common practice in many areas of science to judge results by whether or not their p-values are below 0.05: only when they are, do we call them "statistically significant". If a result isn't considered 'statistically significant,' it's very unlikely to be considered noteworthy enough to be published. But 0.05 is not a magical threshold - it's just a widespread convention. In some areas of science, other numbers are used: for example, some medical trials use 0.01, and particle physics typically uses a value close to 0.0000003. But, even if a result is statistically significant, it might not be worth caring about if the effect size is too small. So, statistical significance is not the same as practical significance.

Even though it does appear that children in particular had higher survival rates on the Titanic, the linear effect across the full range of ages is too small and does not reach statistical significance in our data (as the p-value of 0.073 is above 0.05).

We have now examined one specific association, but we can gain further insight into who survived by considering multiple associations at once. How can we do that? Well, that's where linear regression comes in.

Linear Regression

Imagine you want to predict people's income based on their age. When you have data on this, you can plot the points in a 2D scatter plot, draw a line that best fits them, and then read off an expected income for someone of a given age. If you add more variables - say years of education - the same idea still works: instead of a line through a 2D plot, the model finds a flat surface (a plane, or in higher dimensions a hyperplane) that best fits the data, typically by minimizing squared errors. This is linear regression. It estimates the association between each input and the outcome while accounting for the linear relationships with other measured variables. On the Titanic, for example, it lets us estimate how different personal traits relate to the survival rate while accounting for the other traits.

inear regression finds the line that best fits through scattered data points. Here, using synthetic age-income data, we can predict someone's income by finding where their age intersects the red regression line.

Predicting Survival

Let's now create one unified model that predicts which passengers of the Titanic survived based on several factors. Looking at our dataset, the following seem like particularly promising candidates:

Biological sex. Social expectations at the time may have played a role in who survived; did these norms result in women being deprioritized during evacuation, or did they lead to preferential treatment under the principle of "women and children first"?
Ticket price. Maybe wealthier passengers were more likely to have access to lifeboats - but, on the other hand, that also may depend on whether lifeboats happened to be further or closer to the first class accommodations.
Age. As we saw before, age may have had some (weak) relation to survival, but it didn't reach the common <0.05 threshold for statistical significance. Still, we'll include it in the regression model, as it allows us to consider the role of those other factors after "adjusting for" age.

Based on these three variables, we can now create a simple model to predict whether any given person was likely to survive or not. What would you expect we'll find? Do you think that sex and ticket price will matter, and if so, will they be associated with higher or lower survival?

After setting up this predictive model in Hypothesize, we obtain the chart below, which shows how age, ticket price, and sex each relate to survival above and beyond the other two factors. The blue and gray vertical ranges are confidence intervals, indicating the level of uncertainty in each estimate. While the exact values on the y-axis require some statistical interpretation, a useful shortcut is this: the farther a factor's estimate lies from 0 (the dashed line), the more strongly it predicts survival.

And we obtain the following results:

Predictor	Linear Regression Coefficient (Independent Variable Standardized)	p-Value
(Intercept)	0.73
Age	-0.03	0.048
Ticket Price	0.08	<0.001
Biological Sex = male	-0.51	<0.001

What these values tell us:

Sex has by far the strongest effect (since the coefficient is much further from 0 than the other two), and is clearly statistically significant with a p-value much lower than 0.05.
For age, as expected after our earlier isolated analysis, we get a weak negative coefficient, meaning younger people had a higher chance of surviving even after accounting for sex and ticket price. And this time, with its p-value of 0.048, it even passes as (barely!) statistically significant.
Ticket price has a slight positive coefficient, indicating that passengers who paid more for their tickets also had better chances of making it onto a lifeboat. It's highly statistically significant, so this, like sex, seems like a robust finding.

Of course, these numbers don't tell us why these factors might be linked to survival. They just tell us that, in this dataset, certain patterns hold. Women and people who paid more for their tickets were, on average, more likely to survive, while age was less (but still slightly) predictive. The model tells us that the best way to predict whether any given passenger survived (in this dataset, using these traits) is to give the most weight to sex, then to ticket price, and less to age.

How good are predictions made this way? Hypothesize reports an R score of 0.56, meaning the linear regression's forecasts correlate with the actual survival data at that level. Since the R value represents a correlation, it can range from -1 to 1, where a score near 0 would be equivalent to random guessing. By comparison, 0.56 reflects a fairly strong relationship between the model's predictions and real outcomes.

Should we be content with these results, or might we be able to get this R score closer to 1, thereby making perfectly accurate predictions, by refining our model - perhaps by incorporating additional factors? In practice, perfection often eludes such predictive models for several reasons:

It could be that the evacuation attempts just involved some level of inherent randomness that we can't realistically expect to be able to model at all.
Some potentially predictive variables, like the passengers' health status or their precise location on the ship at the moment of collision, are not part of the dataset.
Another explanation could be found in the limitations of how such models operate: linear regression models tend to assume linear relationships between variables - but this is, of course, not always true, or even a good approximation! For instance, looking at our Age chart from earlier, the survival curve is not at all linearly dropping with age, but rather looks like children in particular received special treatment. Had we added a new "child" variable to clearly distinguish children from non-children, that would have helped the regression model. But, given that we used age directly, our model naturally struggled with such a nonlinear effect. However, such models can still be very powerful tools to predict relatively complex patterns in many cases.

Correlation and Causation

With statistical relationships scattered throughout this analysis, let's get one fundamental caveat out of the way: correlation does not imply causation. It's a point worth reinforcing, even if we have mentioned it many times before (e.g., in our article explaining correlations, or in Are you being misled by statistics?), especially when patterns can be so seductive.

Everything we've done so far has been purely observational: we applied some of the tools of data science to an existing dataset that describes the state of affairs on the Titanic. But to really know what causes what, one often needs to do more than observe: one needs to intervene in a system to measure what changes, such as in randomized controlled trials. So, even though we've seen statistically significant relationships, we must be careful with drawing conclusions about why or how these effects are present in the data, and what stories we tell about them.

For instance, we saw that higher ticket prices were slightly predictive of better survival rates. But there may be many possible explanations for such an observation, and the data we have makes it difficult to figure out how or why this relationship came to be. In particular, this means that we can't simply assume that, for any given person who died on the Titanic, they would have had a higher chance of survival had they bought a more expensive ticket. But it does hint at causality, and at the very least gives us a hypothesis to consider. As xkcd creator Randall Munroe once put it:

"Correlation doesn't imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there'."

All that being said, it is, of course, inherently difficult to answer causal questions about what happened on the Titanic simply because it's a past event, so we have no way of testing causal claims in that same setting. Still, we have derived many interesting insights from the data, including:

Both women and children had higher survival rates than the other passengers, indicating that a "women and children first" policy may have been applied while filling the lifeboats.
Passengers who paid more for their ticket had higher chances of making it onto a lifeboat, whatever the reason for this may be.
These three factors (sex, age, and ticket price) alone enabled our predictive model to reach an impressive predictive accuracy, achieving a correlation of 0.56 with the true survival rate of different passengers.

Wrapping Up

We have covered a lot of ground. We've discussed distributions, three types of "central points" of distributions (mean, median, and mode), three ways to quantify the "width" of a distribution (range, standard deviation, and interquartile range), correlations, p-values, and statistical significance, and even a linear regression model. Do you remember what all these concepts mean? Take a moment and try to recall what we covered.

Here's a quick recap:

Distributions: We looked at the age distribution of passengers on the Titanic. A histogram plot allowed us to get a sense of how prevalent different age ranges were, relative to each other, among the passengers.
Three "central points" of a distribution:
- Mean: The sum divided by the count gives us a "per-unit" view of a variable
- Median: What you obtain after sorting all values and then choosing the one in the center; can be interpreted as a "typical" example from the distribution that is less skewed by outliers than the mean
- Mode: The most common value, or where the distribution curve has its highest peak
Three "widths" of a distribution:
- Range: The distance between the highest and the lowest value of a distribution
- Standard deviation: A way to quantify the distance of the data points from the mean
- Interquartile range: The range of the "middle 50%" of the distribution
Correlation: A metric for the degree of association between two variables, ranging from -1 to 1
p-Value: A number between 0 and 1, telling us how likely it is to observe an effect at least as extreme as a given one (such as a given correlation), if there is no actual association between the variables. It can, e.g., tell us how likely a correlation is to occur by chance alone.
Statistical significance: A convention in some fields of science, where we call p-values < 0.05 "statistically significant", as a standardized way to classify observations as "unlikely to be caused by chance alone"
Linear regression: A simple, linear model that can combine multiple factors to predict a variable (such as the survival rate for different types of people on the Titanic)

Understanding these core statistical concepts isn't just useful for exams or research papers - it's a way to make better sense of the world. Whether you're interpreting a news article, evaluating a medical claim, summarizing data to a colleague, or making a decision at work, these tools can help you reason more clearly and make sound judgments based on evidence. Once you grasp them, they become part of your lifelong toolkit for thinking critically and navigating uncertainty.

We also introduced our new sister project, Hypothesize, which you can use right in your browser, for free, to apply all these concepts (and more!) to real data.

We want to improve Hypothesize further over time, so your feedback is crucial! If you give it a try, we'd love to hear about your experiences. You can share them with us via the "Give Feedback" button at the top of the page.

Try Hypothesize