Here’s one of those articles that I carry around, bound to me by neural Velcro, stored in Instapaper, and gestating in background mode: When Correlation Is Not Causation, But Something Much More Screwy. It’s a 2012 guest piece in The Atlantic by UCLA sociology professor Gabriel Rossman, merging two 2010 posts from his blog, Code and Culture. Another blogger he follows wanted to know if a CNN reader poll that described Megan Fox as both the “worst” and “sexiest” actress was correlation or causation. Rossman says in the first of his two blog posts, “My answer is neither, it’s truncation,” which he then synonimized as “censored”:
What you have to understand is that the question is implicitly about famous actresses….
This means that when we ask about the association of acting talent and sexiness amongst the famous, we have censored data where people who are low on both dimensions are censored out. Within the truncated sample there may be a robust negative association, but the causal relationship is very indirect….
To illustrate this for a population of aspiring actors, he ran a simulation of the relationship between “body” and “mind,” both metrics normally-distributed and assumed orthogonal to each other. He further divided the data into “failed aspirants” and “working actors,” with the latter defined by imagining that “casting directors jointly maximize talent and looks so only the aspiring actors with the highest sum for these two traits actually get work in Hollywood.” (Casting directors who don’t know they are using a joint maximization function may be distressed to see their efforts reduced to a mere 2 lines of code in Rossman’s first blog post.)
Here is the graphical result, with the axes scaled in standard deviations:
The light open circles (“Unobserved”) and solid triangles (“Observed”) are the failed aspirants and working actors, respectively. Here is Rossman’s summary in The Atlantic:
Among those actors we can readily observe there then will be a negative correlation between looks and talent, even though there is no such correlation in the grand population. If we see only the working actors without understanding the censorship process we might think that there is some stupefaction of being ridiculously good-looking.
Rossman subsequently learned that this “logical fallacy” is already well-known as “conditioning on a collider.” This less-intuitive name was coined by his UCLA colleague in computer science and statistics, Judea Pearl, who developed structural (graphical) models for causality. A collider can appear in such network-like models as a node that “blocks the association between the variables that influence it.” In Rossman’s simulation, the variables are body and mind, and the collider is the maximized sum for looks and talent.
To stay in the jargon, the body-mind graph for the working actors has been confounded (obscured or complicated) by this collider, leading to a non-causal association (false correlation) between the two independent variables. If you are big into statistical control, as expressed by phrases such as “we controlled for age, history of smoking, and diet in study participants,” then an unrecognized collider would be a scary thing indeed.
The short answer is “Probably not.” You should keep in mind, however, that I claim no expertise in this cacophony of causing and censoring, colliding and conditioning, confounding and controlling and correlating. (That’s a lot of commencing with “c.” Coincidence? Sí.) Nonetheless, I do get Rossman’s general point well enough to make two comments vis-à-vis SenseMaker analytics.
Firstly, there is a superficial similarity of the body-mind graph (above) to a stones canvas, with its centered “origin” and two orthogonal axes. Neither of those pseudo-dyadic axes, however, is likely to host normally-distributed data. In fact, it would not be unusual to see stones data clustered in, say, the upper-right corner of a canvas. In contrast, that could only happen in the body-mind graph if the aspiring population included, and the casting directors had managed to identify, a bunch of Megan Fox or George Clooney wannabes with +4-sigma IQs of 170 or more.
Secondly, there is the less-obvious similarity of the body-mind graph to a triad. To see this requires changing it from a right triangle in disguise to an equilateral triangle. Here’s how that can happen schematically, showing only a few hypothetical data points (red) for working actors:
Step (1)➔(2) isolates the first (upper-right) quadrant of Rossman’s graph (1). Surprisingly, the remaining data in (2) have “compositional” traits:
Step (2)➔(3) incorporates those four bullets, invoking the closure constraint by capping each axis at 100%; connecting the X and Y endpoints (dashed line); and recasting the origin as Z, the third member of each coordinate triple. In the context of Rossman’s graph, Z is the the complement of the sum for looks and talent:
Z = 100 – (X+Y).
In other words, you can think of this third vertex as a “placeholder” for the collider.
If you still can’t see (3) as a disguised triad, here is an enlarged view showing some equi-percentage lines (yellow) for each component, with mutual intersections of 45º or 90º, as expected in a right triangle, as opposed to the universal 60º in a triad. The transition of Step (3)➔(4) is then just a matter of morphing the right triangle into an equilateral one.
The short answer is “Read on.” Also don’t worry. That lone equation above was it. I’m not going to tell you that there is a bunch of math to learn for your project and story data. The reason is simple: you are already working with the equilateral triangle in (4). Instead, the guidance that this post can offer is heuristic and not especially robust.
Imagine that you have a triad from a project, or more likely for a cohort within the project, where most or all the data points hug one of the legs of the triangle, call it XY. Obviously those respondents did not resonate to the choice presented at the opposing vertex Z when they were signifying their stories, leaving you with a dyad-like line of data along XY. So, you might be asking yourself, did I miss some variable(s)? If I could go back for another round of story collection, what label or property or characteristic would I place at Z in the hope of creating additional discrimination and insight by moving points out into the triad?
Please note, however, that those XY data appear negatively-correlated, exactly the characteristic of conditioning on a collider that Rossman illustrated through his mind-body simulation. This prompts a subtle, but potentially more important question: Does that negative correlation make sense in context? In the context of the project, the cohort (if any), and the total population of respondents? This question matters because the apparent choice of a zero-value for Z means only X or Y (but not both) is independent. The resulting negative correlation is an artifact of the closure constraint, though it should still make sense in context.
Alternatively, does that negative correlation suggest something “screwy” (as in the title of Rossman’s article)? Something like the “stupefaction of being ridiculously good-looking”? If that is the case, then the lack of resonance among the respondents to Z could be telling you that there is a collider, unrecognized and hidden, whose absence concentrated the responses along the XY leg of the triad.
Again very schematically, if the missing property or characteristic had, in fact, been at Z, the responses might have been in a very different location in the triad, perhaps with little weight given to the XY leg and little reason to think about a negative correlation between X and Y. In its absence, however, it becomes a hidden collider. The resulting projection toward the XY leg might be not only artifactual but also nonsensical.