The first section, ‘Data distribution in a triad’, of Part VIIIa listed some of the quantitative methods for comparison of story data in triads, including confidence regions, point-counting of clusters, and smooth contouring. The last of these uses kernel density estimation (KDE) to calculate a probability density function (PDF) for the data. This statistical alphabet soup is an elegant way of showing where the data are concentrated and is also visually appealing because of the continuous nature of the contours.
Unfortunately, a storyteller in a SenseMaker project can intentionally place a story dot at the vertices or along the opposite legs of a triad (a relatively infrequent location for data in most ternary plots). An unexpected consequence of this freedom is that the subsequent mathematical steps in analyzing story data appear to introduce distortion in the KDE-PDF contours for such near-vertex, near-zero data. This problem is described in considerable detail in that prior post. Fortunately, for readers who don’t care about the details, the rest of this post can be read on its own.
The discrete, discontinuous alternative to drawing smooth, continuous KDE-PDF contours is simply to put the story dots in bins and count them. And the easiest way to do that is to superimpose a regular grid on the triad. In other words, we can turn a ternary plot into a two-dimensional histogram, and if we add color-coding then we have a triangular heat map.
Here are examples from the website for ggtern, the R package written by Nicholas Hamilton for ternary plots and widely used in SenseMaker projects; the latest version (2.2.2, partly supported by QED Insight) now includes hexagonal and triangular bins:
This implementation has several noteworthy features:
Nicholas eloquently captured the value of these additions for SenseMaker projects (and others):
There are some subtle differences which give some added functionality, and together these will provide an additional level of richness to ternary diagrams produced with ggtern, when the data-set is perhaps significantly large and points themselves start to lose their meaning from visual clutter.
Here are two examples from one of Laurie’s recent projects, comparing point-count clusters in the left-hand frame and triangular bins in the right-hand one (click each image for a larger version):
The number of story dots is small enough — a few hundred points in each — that they could be included in these tribins without an undue amount of overprinting. Hence, it is easy to see that the accuracy in illustrating where the data are concentrated is unimpeachable. On the other hand, in a data-rich project, the accuracy should still be exact, but the perceived precision will be affected by the granularity of the bin width along the legs. (In the extreme, for only a single bin — the triad itself! — the accuracy is guaranteed to be 100%, but the precision is minimal if the data are not displayed or cannot be resolved.)
The choice of tribins vs. hexbins is largely stylistic and aesthetic. Looking at the blue-scaled examples above, I find the hexagonal grid more visually appealing than the ternary one. But in practice I would favor tribins for three minor reasons:
If you enjoyed this article please consider sharing it!
A series on common statistics and uncommon ideas in ternary plots
The Essentials – useful now
VIIIa: Smoothing, or Where the Data Are not Concentrated
VIIIb: Binning, or Where the Data Are actually Concentrated
IX. How Much the Data Are Concentrated [coming soon]
The Accessories – useful someday… maybe
V. Closure and Causal Structure