As I write this, we’re all hunkered down “flattening the curve” of the COVID-19 global pandemic.
The good news: Lots of Very Smart People have created many predictive analytics models to help us manage the pandemic. We’re all being exposed to these predictive analytics far more often than most of us have been in our pre-COVID-19 lives.
The bad news: Many of these models use different inputs, different heuristics, and come to different—some slightly, some significantly—conclusions.
The differences in these models brought to mind the aphorism that inspired the title, generally attributed to the statistician George Box
. There’s lots of wisdom there, so let’s unpack it.
What do we mean by “wrong”?
Analytics models are “wrong” in the same way that maps are “wrong”—that is to say, they’re necessarily simplified and idealized.
“I have a map of the United States... Actual size. It says, 'Scale: 1 mile = 1 mile.' I spent last summer folding it. I hardly ever unroll it. People ask me where I live, and I say, ‘E6’.”
― Steven Wright
Everyone understands that “the map is not the territory”. Similarly, the analytic is not the data—an analytic model requires the best-available data (just as maps do), but the path from data to analytic is a lossy process.
Models are also like maps in that there are many types. Different types of analytic models can be applied to the same source data, and each may offer different kinds of insight, helping to uncover a larger “truth” that can be used to help improve the chance of reaching desired outcomes.
Wrong can be better
It might be tempting to think of model simplification and idealization as a shortcut. However, sometimes it’s necessary for usability and decision-making.
Subway maps are a great example of this. They’re often radically different than the actual geometry of the system, as shown in this animated morph
between (1) the map of the Paris Metro subway system and (2) the real-world geometry of the system.
Importantly, the approximation of the subway map adds clarity and reveals the internal logic of the system in a way that the geometric map does not.
All models are approximations
To get from data to an analytic, models must make assumptions. This means that all analytics, to a greater or lesser degree, are approximations.
“This makes all analytics, to a greater or lesser degree, approximations.”
Some assumptions are explicit, meaning that a human has made decisions in the process of creating an analytic.
As a simple example, consider the “views” count of a video advertisement—a seemingly simple metric used as an input for lots of analytic models. Do we count it as a view only when the entire ad plays? Or must it play for a minimum time, or for a minimum percentage of the ad duration? And during that time, must every pixel of the video advertisement be in view 100% of the time?
Interestingly, even in the case of this incredibly-simple metric, there is no universal standard. YouTube counts an ad as “viewed” if 30 seconds of an un-skippable ad plays, while Facebook happily counts three seconds of playback as a view. LinkedIn lowers that to two seconds, and only requires 50% of the video to be in view.
Other assumptions are implicit, meaning that they haven’t been expressed and may not even be known. They may be a side-effect of the algorithm used, or even of the data used by an AI algorithm.
One critical aspect of data science is attempting to understand the implicit assumptions that the data, the process, and the algorithms used may be making. This is often difficult, since it may be difficult (and sometimes impossible) to determine how an AI algorithm arrived at a result.
For example, an AI algorithm may be trained on decades of data. If the data itself has bias represented in it, the AI algorithms using that data will also be biased. Famously, the face recognition system that Amazon sells to police departments matched 28 members of Congress with mugshots
, with most of those matches for Congresspeople of color.
The false authority of exact numbers
According to a 2015 study of mergers and acquisitions
, investors who offer “precise” bids for company shares yield better market outcomes than those who provide round-numbered bids.
Models generally result in exact numbers instead of rounded ones. The problem with that is that our pattern-seeking brains interpret those exact numbers as “more authoritative” than rounded ones, even though both are estimates.
So, beware exact numbers. As a reader, remind yourself that any numbers representing things that haven’t happened yet are estimates. As a model creator, help people who will consume the output of your analytic model understand that although a model’s output may appear
to be precise, it’s a guess which is sure to be wrong (but hopefully right enough to be useful).
Assumptions and “black swan” events
A black swan event
is a difficult-to-predict event that means that “normal” is no longer normal. As I write this, a recent example is the ongoing 2019–20 coronavirus pandemic.
The societal changes we’ve made as a result of COVID-19 were unimaginable by most just a few months earlier. Although larger companies (Teradata included) had pandemic playbooks, the pandemic has been a litmus test for the explicit and implicit assumptions baked into analytic models.
Many models continued to “just work”, through foresight and (sometimes) luck. Many simply broke, resulting in some of the temporarily operational chaos we’ve seen in early 2020.
Models are useful
Yes, they’re imperfect. Models are approximations and depend on assumptions, implicit and explicit. It’s important to never forgot that all models are “wrong”— and that that’s not only okay
, but desirable
And yet, analytic models are incredibly useful and important, and a primary tool for getting from “data” to “insight”. Analytic models are how our customers extract value from a truly incredible amount of data, turning that data into actionable business practices to get the outcomes they want.