Why Data Mining Is a Double-Edge Sword for Social Services
Drawing the right conclusions from client data in social services can be challenging. How do we mine data without drawing the wrong conclusions?
A cowboy in Texas is shooting empty bottles in front of a barn. His bullets miss every single bottle and instead hit the side of the barn. He then walks up to the barn and paints a bull’s-eye around each new hole in the wall, and says, “See? I was aiming there all along!”
You’d be right to be skeptical. After all, this behavior is an example of post hoc reasoning, or deciding that something is evidence for your desired conclusions only after you know what the evidence is. This analogy is an informal logical fallacy known as the Texas Sharpshooter Fallacy.
In social analytics, we’re often tempted to find patterns first from only a few factors, despite the myriad, multivariate factors that can affect an outcome. After all, we humans are pattern-seeking animals.
While there are plenty of great articles on how to effectively mine data, this article outlines why data mining can be a double-edged sword and the scientific principles that mitigate those trade-offs.
Correlations, Coincidences, and Post Hoc Reasoning
Let’s say you go on a date and find an uncanny number of similarities between you and your potential partner. You share similar interests, have similar favorite TV shows from your childhood, drive the same model of car, and even have mothers who share a first name! Journalist David McRaney1 explains:
This is meant to be, you think. You are made for each other. . . But, take a step back. . . How many people in the world own that model of car? You are both about the same age, so your mothers are too, and their names were probably common in their time. Since you and your date have similar backgrounds and grew up in the same decade, you probably share the same childhood TV shows. . . You are lulled by the signal. You forget about noise. With meaning, you overlook randomness, but meaning is a human construction.
Post hoc reasoning after finding coincidences is a huge temptation in analyzing data in social services. However, keep in mind a key principle of behavioral science: no social phenomenon can solely be explained by just one factor.
Let’s say you add a new question to your intake process, and client requests for services fall by a small percentage after that change. Did the new intake question cause the drop-off? It’s probably more complicated than that.
One scientific way to test that hypothesis would be to remove the question and see whether the drop-off pattern holds over the same period of time.
Lots of factors can affect services performed or requested, including weather, socioeconomic factors, language barriers, and many more.
The key is to form your hypotheses before mining data, so you can’t draw spurious conclusions.
Data Mining’s Limits
There’s nothing inherently wrong with mining data to look for correlations, says Steven Novella, prominent skeptic and Yale medical professor.2
“[S]ince random data are clumpy, we should expect to see accidental correlations even when there is no real underlying phenomenon causing the correlation. One methodological pitfall of data mining is not determining ahead of time what potential correlations are being searched for—so any correlation counts as a hit.”
Statisticians and analysts will likely know some of these pitfalls of data mining already, but for the marketing generalist or leader with a cursory knowledge, this quote might seem surprising. Numbers don’t lie, do they?
Turns out they do. Sometimes.
Seeing illusory patterns seems to have a lot to do with our sense of control. When we feel out of control, whatever that means to you, we’ll often seek to regain that sense of control by finding patterns, according to one psychological experiment:
The researchers asked 41 undergraduates to recall a situation in which they’d lacked control (such as being a passenger in a car accident) and another group to recall a situation in which they’d had full control (such as going into an exam well-prepared). Then the subjects read passages describing an event preceded by an action that may or may not have influenced the event. One passage asked them to imagine that they were successful marketers whose ideas were rejected after they failed to perform their customary ritual of stomping the ground three times before the meeting. The subjects who previously recalled an in-control experience were more likely to write this off as mere coincidence than were those who’d recalled being out of control.
This effect has been replicated in other similar studies, all producing similar results: when humans seem to feel the need for controlling their environment or experience, they will be put on high alert for patterns—hyperactive pattern recognition.
It’s one of the reasons professional baseball players are famously superstitious—so much of what happens in a baseball game has little to do with their personal skill, depending instead on forces of randomness.
Don’t give up on your client data just quite yet. There are plenty of reasons to believe that an intake can affect whether or how clients seek out your organization’s services, but there’s a limit to what we can know with correlational data.
That’s why Eccovia offers data science as a service, as well as a brand-new data warehouse called ClientInsight, so you can be more confident in the conclusions you draw from your data.
- David McRaney. You Are Not So Smart, p.38–39. Gotham Books: New York, 2011
- Steven Novella et al. The Skeptics’ Guide to the Universe: How to Know What’s Really Real in a World Increasingly Full of Fake, p. 82–83. Grand Central Publishing: New York, 2018.