• Statistics in the Triad, Part III: Random Data

    Story data in a SenseMaker triad tend to cluster in one of seven locations — the three vertices, the midpoints of the three edges, and the center. It’s also common to find “stringers” between the center and one or more of the other six loci (i.e., along the altitudes of the triangle); there are generally fewer, scattered points elsewhere in a typical triad. Here’s an example:

    normative_triad

    Both the presence and density of story points depend, of course, on each respondent’s reaction vis-a-vis the lead-in (“The events in the story happened…”), and not unimportantly on the precision with which the person places a dot on paper or a fingertip on a touchscreen. Assuming that the prompting question and signifiers are well-designed, the cumulative pattern across all users should enable discernment of the influences and modulators of their experiences.

    What happens, however, if the data cluster elsewhere… or nowhere? In the extreme, what if the data are random? Despite decades of using ternary plots in my prior life as a lab scientist, I had never considered this question. (I’d like to think this says more about the inherent regularity of the chemistry of volcanic rocks than it does about my powers of imagination, but never mind.)

    That changed in late October, during a workshop, when Laurie presented initial results for a client’s project that had yielded more than 1400 stories from participants spread across seven principal cohorts. By far the smallest cohort, a group of “community leaders,” had told only 50 stories. Even in the data for this small subset, most of the triads looked typical, but there were a few where, if you didn’t already know the pattern, you might not have been quick to define a norm.

    workshop_sm

    One of the latter triads (see final image pair, below) prompted a rhetorical question from a member of the client team, asking about the scatter and seeming “lack of definition” (my words, not hers) in the distribution of points. During the next break, I asked her if she could imagine a scenario in which these community leaders might have placed their points in that particular triad with such a degree of generality or perhaps casualness that they could appear to be random. That led, in turn, to the question of what randomness in a triad would actually look like.

    The answer is not surprising — randomness looks, well, random (see immediately below). But the question has to be addressed, just to be sure, because the closure constraint on ternary data (summation to 100%) produces counter-intuitive effects. Chief among these are spurious negative correlations and the need to represent an “average” by the geometric mean, rather than the more familiar arithmetic mean. (These are discussed in Part I and Part II of this series.)

    Here are plots of 50 and 500 points generated in Excel with its standard RANDBETWEEN function:

    rand50-rand500_comp

    These data were calculated in a “cycle” of three steps, starting with these steps for the first point (first row in Excel):
    1. find the A-coordinate (lower-left vertex) with RANDBETWEEN(0,100), which gives an integer value between 0 and 100 inclusive;
    2. find the B-coordinate (top) with RANDBETWEEN(0,100-A); and
    3. find the C-coordinate (lower-right) by subtraction from 100-A-B.

    Then repeat for the second point (row), but now in order 1:B, 2:C, 3:A; repeat for the third point (row) in order 1:C, 2:A, 3:B; then back to the first cycle for the fourth row, etc. This “supercycle” minimizes the clustering of values near the vertices that seems to arise in some recalculations of the data when all rows initiate on the same vertex.

    The shortcomings in Excel’s random number generator are widely acknowledged, but the end result is surely good enough for purposes of this post. And definitely superior to what the Trolls in Accounting provided to Dilbert some years ago.

    Now we can compare that most-scattered triad for the community leaders with a plot of 50 random numbers. The unlabelled plot of actual data from the workshop was prepared by Laurie in Tableau, and the concentric circles and half-altitudes are part of her standard template. Similarly, the random-number plot has vertical gridlines and x-y axes that are artifacts of transforming the 3-component data to plot in Cartesian coordinates, since Excel cannot create “native” ternary diagrams. If you can ignore all these technical add-ons, the two distributions of points are fairly similar. In fact, if you can picture the random-number triangle rotated 120° CCW, there are some surprisingly good correspondences.

    t4-random50_comp

    Full disclosure: I chose this particular random pattern from the several tens of trials that I ran precisely because of the degree of visual similarity. The point is not to show, however, that the community leaders were thinking randomly. Actually, I don’t know what that would mean. There is nothing “random” about respondents’ stories and signifiers — they presumably knew exactly what they meant! Instead, I think of it as a minor cautionary tale: especially with small-sample projects or sub-cohorts, there could be a gradation from well-defined response patterns to ones that were visually random. This is just what showed up in the data at the workshop for the small cohort.

    The inevitable question then is how does an analyst or a subject-matter expert deal with the decreased utility across a range from well-defined, highly-aggregated data to random-looking, highly-scattered data. If there isn’t an app for that, is there at least a comparative metric? These are much more general questions that apply to a project of any size and ones that I will come back to in the next post.

You must be logged in to post a comment.