Algorithms Beat Intuition - the Evidence is Everywhere

Would you trust a computer to drive your car?

Google’s self-driving cars are now driving on city streets, and are rapidly becoming better drivers than humans in many respects. A recent Google blog post explains: http://googleblog.blogspot.co.uk/2014/04/the-latest-chapter-for-self-driving-car.html

Surprised? You shouldn’t be. Studies have shown that rules-based models can consistently outperform human judgment, so it seems reasonable to believe that software, given the right set of rules, could be better at detecting stop signs or pedestrians than a distracted human driver. And a big problem could occur for the passenger when there is a conflict between the driver’s decision and a computer’s decision. So which should you trust?

I was at a conference a few months ago, when Wes was giving his “Are You Trying Too Hard” talk, which explores these contrasting approaches to decision-making: 1) the “clinical” method, which relies on an expert’s experience and judgment, and 2) an “actuarial” method, which relies on statistical trends.

It’s a great lecture that explores several of the many studies that have been made across academia covering every domain you could possibly think of: credit risk assessment, academic performance, medical diagnosis, business success/failure, game outcomes and point spreads, career satisfaction, fitness for military service, criminal recidivism, wine tasting, advertising sales, and the list goes on.

I recall that one particular red-faced skeptic became very agitated when presented with the evidence as applied to the stock market.
“You have cherry picked the data! These actuarial approaches may work in medicine, but it’s obvious they can’t work on Wall Street!”

He was practically yelling (and sweating profusely).

As it turns out, this is actually a pretty old debate – over half a century old, in fact.

In 1954, Paul Meehl wrote “Clinical versus Statistical Prediction: A Theoretical Analysis and Review of the Literature,” a seminal work that evaluated the existing research at the time and concluded that:
There is no convincing reason to assume that explicitly formalized mathematical rules and the clinician's creativity are equally suited for any given kind of task, or that their comparative effectiveness is the same for different tasks. Current clinical practice should be much more critically examined with this in mind than it has been.

While this seems like a reasonable and modest statement that illuminates the contours of the models versus experts debate, it set off an intellectual firestorm that still reverberates in academia today. “The data is cherry picked! This won’t hold when you look at different domains!”

Yet subsequent research continued to build the case. It appeared that systematic decision-making beat the experts practically everywhere the academics looked.

In 2000, Grove, Zald, Lebow, Snitz and Nelson published, “Clinical Versus Mechanical Prediction: A Meta-Analysis” (a copy of which can be found here: http://datacolada.org/wp-content/uploads/2014/01/Grove-et-al.-2000.pdf) which reviewed decades of such studies, covering every study that the authors were able to identify. The study concluded:
…We identified no systematic exceptions to the general superiority (or at least material equivalence) of mechanical prediction.

Many found, and continue to find, these ideas to be offensive. It would seem humans sometimes display overconfidence when it comes to questions about their judgment.

The red-faced skeptic at the conference is a case in point. He hated the idea that it could even be possible that there should be a meaningful distinction between the clinical and the actuarial for a specific domain, in this case Wall Street, where he was convinced that actuarial methods could not possibly work. After all, Wall Street should be different, since the experts there can access such a wide variety of information, and can apply a diverse skillset and professional judgment.

As Meehl recognized, these are two clearly distinguishable methods, which are easily tested, irrespective of domain. For some reason, people can’t seem to get their minds around this fact.

In 1986, 32 years after the publication of his book and a lot more research, Meehl followed up with a more explicit statement:
You have two quite different procedures for combining a finite set of information to arrive at a predictive decision…[These two procedures] disagree a sizeable fraction of the time…The plain fact is that [a decision maker] cannot act in accordance with both of [two] incompatible predictions. Nobody disputes that it is possible to improve clinicians’ practices by informing them of their track records actuarially. Nobody has ever disputed that the actuary would be well advised to listen to clinicians in setting up the set of variables.

Despite the clarity of this explanation, many people persist, perhaps via some form of cognitive dissonance, in failing to even consider the evidence.

So why might it be that actuarial models can succeed where experts fail? What makes it so difficult for experts to use their intuition in an effective way when compared with a model? As Grove, et al. put it in “Clinical Versus Mechanical Prediction”:
Humans are susceptible to many errors in clinical judgment. These include ignoring base rates, assigning nonoptimal weights to cues, failure to take into account regression toward the mean, and failure to properly assess covariation.

In short, intuition has limits, and is flawed.

We turn to the psychologist Herbert Simon for a useful definition of intuition:
The situation has provided a cue; this cue has given the expert access to information stored in memory, and the information provides the answer. Intuition is nothing more and nothing less than recognition.

The expert’s “recognition" is then, in a certain sense, just another model, and can sometimes offer an incomplete or biased picture of reality. Which brings us back to Google’s self-driving cars. Google recognizes that if it can chip away at the huge range of signals in the driving environment – buses, stop signs, cyclists, pedestrians – its self-driving cars will become better and better at driving. The more comprehensive the actuarial driving model, the better the self-driving cars will perform. As the blog post states:
A self-driving vehicle can pay attention to all of these things in a way that a human physically can’t – and it never gets tired or distracted.

And models can be extended, often far beyond the capabilities of experts. They might, for example, have the ability to recognize subtle signals.

Daniel Kahneman, in “Thinking Fast and Slow,” describes how this can be an advantage:
Statistical algorithms greatly outdo humans in noisy environments for two reasons: they are more likely than human judges to detect weakly valid cues and much more likely to maintain a modest level of accuracy by using such cues consistently.

So how about it? Are you like our red-faced skeptic who might say, “self-driven cars will never be on the road!” Or do you think that Google’s algorithms might at some point overtake the ability of human drivers? It’s an open question, at least for now.

But what about Wall Street? Perhaps you are a red-faced skeptic who believes that models don’t work at all (and never will)? Or perhaps you believe that some day in the future, the best, most comprehensive models will beat experts? Or perhaps you believe, as we do, that today, even simple models can already beat experts just like Meehl has been saying since 1954?

If you liked this post, don't forget to subscribe to Alpha Architect