Note: If you landed here by searching on “datasaurus” (± “dozen”) and have no idea what SenseMaker is, you can jump to the graphical results.
In an earlier post in this series, Part III: Random Data, I showed an example of a SenseMaker triad, with data clustered near vertices, along edges, and in the center; most participants used one of those seven locations to signify their stories, weighted toward one, two, or all three corners, respectively. I also showed a ternary with 500 random points. Here they are side-by-side:
My plan was to write another post about how an analyst or subject-matter expert might deal with such a “spectrum,” ranging between one end-member with well-defined, highly-aggregated data and another with random-looking, highly-scattered data. Surely those two poles would cover the triadic universe, right? That plan was sidetracked, however, when I recognized the possibility of aberrant cases, ones that are unlikely to arise with story data, but which might nonetheless provide some insight. So this post is about how to derive those cases and about the esoteric lessons therein.
The trigger for this change of plans was ‘The Datasaurus Dozen’ and its inspiration, ‘Datasaurus’, which I only learned of belatedly. The latter was introduced a year ago by Alberto Cairo (U. of Miami) in this tweet:
Don’t trust summary statistics. Always visualize your data first https://t.co/63RxirsTuYpic.twitter.com/5j94Dw9UAf
— Alberto Cairo (@albertocairo)
The eternal point is clear and succinct: Don’t just calculate, plot!
Clarity and brevity notwithstanding, Justin Matejka and George Fitzmaurice of Autodesk Research decided to reinforce the message. They created an additional twelve (x,y) datasets, The Datasaurus Dozen, with the same summary statistics – arithmetic means, standard deviations, and correlation coefficient, all identical to two decimal places – but visually distinct graphical patterns. There are horizontal, vertical, and diagonal parallel lines; fuzzier horizontal and vertical swaths; a grid and a blob of points; a big “X”; a five-pointed star; and single and double circles. There were no other life forms though, either extant or extinct.
If you’re more of a viewer than reader, you can watch this video and then scroll down to the next heading without missing any essentials. Or you can read on, ± watching.
This project is described in a research news article, Same Stats, Different Graphs. Not surprisingly for the Autodesk site, there are excellent graphics, including several animated gifs. Matejka and Fitzmaurice acknowledged and augmented the ancestor to all such constructs, Anscombe’s quartet, four datasets published in 1973 by Yale statistician F.J. Anscombe, who exhorted his contemporaries with the same message: Don’t just calculate, plot!
There are also links in the news article to the technical paper that Matejka and Fitzmaurice presented at an ACM conference, under the same title, in which they detailed the construction of The Datasaurus Dozen datasets. Here is their description of the novelty in their approach:
The key insight behind our approach is that while generating a dataset from scratch to have particular statistical properties is relatively difficult, it is relatively easy to take an existing dataset, modify it slightly, and maintain (nearly) the same statistical properties. With repetition, this process creates a dataset with a different visual appearance from the original, while maintaining the same statistical properties. Further, if the modifications to the dataset are biased to move the points towards a particular goal, the resulting graph can be directed towards a particular visual appearance.
In the most prominent of their results (below), Datasaurus is the seed dataset from which The Dozen were created:
My immediate question when I saw these patterns was, what do they look like mapped into a triad? Here is the immediate answer for The Dozen (arrayed identically to the preceding figure; clickable for a much larger image):
And here is a side-by-side comparison of the rectilinear and ternary versions of Dinosaurus:
The dotted green and yellow lines connect six equivalent points and clarify that the ternary Dinosaurus is facing right, that is, the vertical join between the two plots is a mirror plane of sorts. Details of the reason for a “re-scaled” plot and the construction of the triad from it are given in the Appendix.
Mind you, I would never expect to see story dots forming even a star or a circle, let alone a dinosaur. When some other parameter, such as time or reward structure, is an independent (controlled) variable in a story-collection process, however, then unusual data structures may provide guidance in interpretation and, perhaps surprisingly, in prediction. But first a brief recap of the back-and-forth of the relevant transformations.
Part II: Log-Ratio Transformation in this series gave a foretaste of the relevance of looking at SenseMaker data in both ternary and rectilinear coordinates. As that post discussed, the fact that we can present summary statistics for a triad — to say nothing of more advanced metrics like kernel density estimates (see Part VIIIa) — is due to the methodology created by statistician John Aitchison (see References in Part II). In a nutshell, constant-sum ternary coordinates are transformed to open-ended (x,y) coordinates in a log-ratio space where standard statistical calculations can be performed reliably; and the results are then inverse-transformed back to the ternary. Part IV: Confidence Regions is an outcome of just such a procedure.
There are three log-ratio transformations in common usage, the additive (alr), centered (clr), and isometric (ilr). The first two were developed by Aitchison and the third by Vera Pawlowsky-Glahn and her collaborators (see References and Additional Readings in Part II). There is a clear and equation-free, though still highly mathematical, discussion of the pros and cons of each in the introduction of Egozcue et al. (2003), in which they first introduced the ilr transformation:
[It] is called isometric because it allows us to associate angles and distances in the [triad] to angles and distances in [the transformed rectilinear plot], where we feel more comfortable from an intuitive point of view. This is of particular interest with respect to concepts of orthogonality.
Here is an example from their paper of parallel (solid) and orthogonal (dashed) lines transformed between the two coordinate systems:
Note that The Dinosaurus Dozen pair labelled “x_shape” (3rd row, 3rd column in each panel, above) shows exactly this behavior, but for two intersecting lines [1], rather than two parallel ones.
Egozcue et al. refer to the solid lines as “compositional processes” and cite bacterial growth, radioactive decay, and sedimentary deposition as natural examples of these patterns. In fact, as the names imply, the solid curves/lines are parametric in time, which varies independently along them, even though it does not appear as an explicit variable on any axis or vertex. In the research literature of the respective disciplines — biology, physics, geology — these processes are more likely to be shown in a log-ratio plot than a ternary.
On the other hand, here are three examples of parameterization in which a ternary is the more common visualization choice because the emphasis is on multi-component compositional change (rather than increase/decrease of a single “species” as in Egozcue et al.’s examples):
There are at least two categories of SenseMaker projects that could show parameterized data. Firstly, punctuated or continuous capture of stories is inherently time-parametric. There would be no guarantee that signified results would actually change over time, but looking for such change is a primary motivation for the approach. It is also a means of testing whether adjustments of extrinsic factors, including safe-to-fail experiments, had detectable effects.
Secondly, the instrument itself might be parameterized as a means of testing some aspect of the methodology. Imagine a large, homogeneous population of respondents, subsets of whom were presented with different versions of a prompting question, yet attached to the same labelled triad. If those versions were designed to fall along some “spectrum,” then it would be interesting to see if there was a corresponding array of “compositions.”
As the first figure in this post reminds us, story data in a real SenseMaker triad are likely to be very blobby. Rarely will there be precise patterns that would warm the heart of a mathematician. The potential for parameterization — in time, reward structure, or some other study variable, or in the instrument itself — indicates, however, that the data could act as a directional pointer. Consideration of both ternary and rectilinear patterns, as in the pair of graphs immediately above, could suggest optimal pathways [2] along which respondents might move (or be moved) due to future interventions. Given the nature of complex problems and the inherent messiness of human nature, this would be, at best, a second-order effect. It would imply nonetheless a minor addendum to the continuing lesson of Datasaurus and The Datasaurus Dozen, offered here in the spirit of xkcd [3]:
Imagine that the data in a rectilinear plot, say, Datasaurus, are the result of a log-ratio transformation from some unseen ternary plot. The latter can be recovered by applying an inverse transformation to the (x,y) data, which I did subject to the following qualifications and comments:
Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueras, G., and Barcelo-Vidal, C. (2003) Isometric logratio transformation for compositional data analysis. Mathematical Geology, v. 35, p. 279-300.
Martín-Fernández, J.A., Olea-Meneses, R.A., and Pawlowsky-Glahn, V. (2001) Criteria to compare estimation methods of regionalized compositions. Mathematical Geology, v. 33, p. 889-909.
Pawlowsky, V. (1989) Cokriging of regionalized compositions. Mathematical Geology, v. 21, p. 513-521.
If you enjoyed this article please consider sharing it!
A series on common statistics and uncommon ideas in ternary plots
The Essentials – useful now
VIIIa: Smoothing, or Where the Data Are not Concentrated
VIIIb: Binning, or Where the Data Are actually Concentrated
IX. How Much the Data Are Concentrated [coming soon]
The Accessories – useful someday… maybe
V. Closure and Causal Structure