Attributing the quip to former British prime minister Benjamin Disraeli, Mark Twain once wrote, “There are three kinds of lies: lies, damned lies, and statistics.” Given the foundational importance of statistics in modern science, this quote paints a bleak picture of the scientific endeavor. Thankfully, several generations of scientific progress have proved Twain’s sentiment to be an exaggeration. Still, we shouldn’t discard the wisdom in his words. While statistics is an essential tool for understanding the world, employing it responsibly and avoiding its pitfalls requires a delicate dance.
One maxim that should be etched into the walls of all scientific institutions is to visualize your data. Statistics specializes in applying objective quantitative measures to understand data, but there is no substitute for actually graphing it out and getting a look at its shape and structure with one’s own eyeballs. In 1973, statistician Francis Anscombe feared that others in his field were losing sight of the value of visualization, writing, “Few of us escape being indoctrinated” with the notion that “numerical calculations are exact, but graphs are rough.” To quash this myth, Anscombe devised an ingenious demonstration known as “Anscombe’s quartet.” Together with its wacky successor, the “datasaurus dozen,” nothing more dramatically communicates the primacy of visualization in data analysis.
To appreciate Anscombe’s quartet, let’s slip into the lab coat of a scientist. Suppose you’re interested in the relationship between how much people exercise and how much they sleep. You survey a random sample of the population about their habits, record their answers in a spreadsheet and run the results through your favorite statistics software. The resulting summary statistics look like this (this is just an example and is not based on real data):
Hours of exercise per week – Average: 7.5, Standard deviation: 2.03
Hours of sleep per day – Average: 9, Standard deviation: 3.32
Correlation between the two: .816
On average, the people in your sample exercise 7.5 hours per week and sleep 9 hours per day. Standard deviation measures how much variation there is in your sample. For both variables it’s moderately sized, indicating that most people you surveyed don’t veer too much from the averages. The two are highly correlated, which implies that people who exercise more are also likely to sleep more. The software also outputs a line of best fit, which describes the general trend of your data as the line below:
Given this summary, it might be tempting to suppose that the data look something like this:
Each dot represents one person in your survey and is positioned according to their personal sleep and exercise habits. The above depicts a strong upward linear trend, suggesting that as people exercise more, they also sleep more (perhaps because both are indicative of generally healthy lifestyles or because workouts are fatiguing) with little random variations characteristic of the messy real world. Anscombe showed that, amazingly, all four datasets below have the identical summary statistics:
(Anscombe’s data sets don’t actually correspond to any specific experiment, we’ve contrived one here for illustrative purposes). Dataset two, despite having the same statistical profile as dataset one, tells a completely different story when plotted. Here, the relationship is clearly not linear and for some reason exercise starts to taper off for people who sleep the most (perhaps because sleeping so much leaves little time for other activities). Plot three shows a perfect linear relationship, with one outlier who exercises an abnormal amount and skews the results. Plot 4 shows that almost everybody sleeps exactly 8 hours per day and it has no relationship to how much they exercise, while one person in the sample sleeps all day and presumably spends all of their waking time exercising. Notice how we actually draw very different conclusions from the same statistics once we visualize the data.
Despite its popularity, nobody knows how Anscombe concocted his famous quartet. Justin Matejka and George Fitzmaurice sought to rectify this and took the concept to its extreme, demonstrating a general purpose method for taking any data set and transforming it into any target shape of your choosing while preserving whichever summary statistics you want (up to two decimal places). The results are the “datasaurus dozen.”
All of the scatter plots above have the same summary statistics! Astute readers might notice that it’s actually a datasaurus baker’s dozen. The dinosaur dataset was actually the seed from which all of the others were generated (it’s an homage to data visualization expert Alberto Cairo’s tongue-in-cheek T. rex dataset.) This wonderful gif shows the plots transforming into one an another and tracking the changing stats on the side. Even the transition frames preserve the statistics. Clearly summary statistics alone tell an inadequate story.
Anscombe would probably be proud that his quartet lives on as a common pedagogical demonstration in modern statistics classes. As Yogi Berra said: “You can observe a lot by watching.”