Home » SenseMaker® » Statistics in the Triad, Part VIIIb: Binning, or Where the Data Are actually Concentrated
  • Statistics in the Triad, Part VIIIb: Binning, or Where the Data Are actually Concentrated

    The first section, ‘Data distribution in a triad’, of Part VIIIa listed some of the quantitative methods for comparison of story data in triads, including confidence regions, point-counting of clusters, and smooth contouring. The last of these uses kernel density estimation (KDE) to calculate a probability density function (PDF) for the data. This statistical alphabet soup is an elegant way of showing where the data are concentrated and is also visually appealing because of the continuous nature of the contours.

    Unfortunately, a storyteller in a SenseMaker project can intentionally place a story dot at the vertices or along the opposite legs of a triad (a relatively infrequent location for data in most ternary plots). An unexpected consequence of this freedom is that the subsequent mathematical steps in analyzing story data appear to introduce distortion in the KDE-PDF contours for such near-vertex, near-zero data. This problem is described in considerable detail in that prior post. Fortunately, for readers who don’t care about the details, the rest of this post can be read on its own.

    Turning a triad into a histogram

    The discrete, discontinuous alternative to drawing smooth, continuous KDE-PDF contours is simply to put the story dots in bins and count them. And the easiest way to do that is to superimpose a regular grid on the triad. In other words, we can turn a ternary plot into a two-dimensional histogram, and if we add color-coding then we have a triangular heat map.

    Here are examples from the website for ggtern, the R package written by Nicholas Hamilton for ternary plots and widely used in SenseMaker projects; the latest version (2.2.2, partly supported by QED Insight) now includes hexagonal and triangular bins:

    This implementation has several noteworthy features:

    • customizable color-coding for the binning scale;
    • specification of a bin width along each leg, e.g., n = 5 corresponds to a nominal 20% grid (but with different meaning for tribins vs. hexbins, see examples above and the discussion at the ggtern link); and
    • ability to display a calculated scalar value (e.g., mean age of respondents) for each bin (not shown above).

    Nicholas eloquently captured the value of these additions for SenseMaker projects (and others):

    There are some subtle differences which give some added functionality, and together these will provide an additional level of richness to ternary diagrams produced with ggtern, when the data-set is perhaps significantly large and points themselves start to lose their meaning from visual clutter.

    How hot is your triad?

    Here are two examples from one of Laurie’s recent projects, comparing point-count clusters in the left-hand frame and triangular bins in the right-hand one (click each image for a larger version):

    The number of story dots is small enough — a few hundred points in each — that they could be included in these tribins without an undue amount of overprinting. Hence, it is easy to see that the accuracy in illustrating where the data are concentrated is unimpeachable. On the other hand, in a data-rich project, the accuracy should still be exact, but the perceived precision will be affected by the granularity of the bin width along the legs. (In the extreme, for only a single bin — the triad itself! — the accuracy is guaranteed to be 100%, but the precision is minimal if the data are not displayed or cannot be resolved.)

    The choice of tribins vs. hexbins is largely stylistic and aesthetic. Looking at the blue-scaled examples above, I find the hexagonal grid more visually appealing than the ternary one. But in practice I would favor tribins for three minor reasons:

    • A triangular mesh covers the entire figure with equal-size bins. Unless you have an unusual client or audience that is already familiar with the coordinate geometry of ternary plots, it would probably make life easier for them and you not to have to explain the presence in the hexagonal mesh of whole, one-half, and one-sixth fractional bins. Imagine a puzzled voice asking, “Which is more important in interpreting these responses, a light blue bin in the interior, or a little wedge of the same color in the corner?…” There is no good reason even to open the door to that kind of question.
    • In the event that you needed to consider equal-area-density of responses, you would have to make an adjustment for those fractional hexbins, but not for tribins. (As I will discuss in the next post in this series, this is not the remote possibility that it might seem on first reading.)
    • In a client workshop or other setting where you wanted to choose a precise subset of stories for analysis or theming, this would be easier with tribins — as little as drawing 3 bounding lines with a cursor. But with hexbins it’s a more tedious exercise.

    Tags: , , , ,