Statistics in the Triad, Part VII: Mapping The Datasaurus Dozen

Note: If you landed here by searching on “datasaurus” (± “dozen”) and have no idea what SenseMaker is, you can jump to the graphical results.

In an earlier post in this series, Part III: Random Data, I showed an example of a SenseMaker triad, with data clustered near vertices, along edges, and in the center; most participants used one of those seven locations to signify their stories, weighted toward one, two, or all three corners, respectively. I also showed a ternary with 500 random points. Here they are side-by-side:

Proj Rand combo

My plan was to write another post about how an analyst or subject-matter expert might deal with such a “spectrum,” ranging between one end-member with well-defined, highly-aggregated data and another with random-looking, highly-scattered data. Surely those two poles would cover the triadic universe, right? That plan was sidetracked, however, when I recognized the possibility of aberrant cases, ones that are unlikely to arise with story data, but which might nonetheless provide some insight. So this post is about how to derive those cases and about the esoteric lessons therein.

Datasaurus…

The trigger for this change of plans was ‘The Datasaurus Dozen’ and its inspiration, ‘Datasaurus’, which I only learned of belatedly. The latter was introduced a year ago by Alberto Cairo (Knight Chair in Visual Journalism, University of Miami) in a now-deleted tweet from 2016: “Don’t trust summary statistics. Always visualize your data first.”

The eternal point is clear and succinct: Don’t just calculate, plot!

…leads to The Datasaurus Dozen

Clarity and brevity notwithstanding, Justin Matejka and George Fitzmaurice of Autodesk Research decided to reinforce the message. They created an additional twelve (x,y) datasets, The Datasaurus Dozen, with the same summary statistics – arithmetic means, standard deviations, and correlation coefficient, all identical to two decimal places – but visually distinct graphical patterns. There are horizontal, vertical, and diagonal parallel lines; fuzzier horizontal and vertical swaths; a grid and a blob of points; a big “X”; a five-pointed star; and single and double circles. There were no other life forms though, either extant or extinct.

If you’re more of a viewer than reader, you can watch this video and then scroll down to the next heading without missing any essentials. Or you can read on, ± watching.

This project is described in a research news article, Same Stats, Different Graphs. Not surprisingly for the Autodesk site, there are excellent graphics, including several animated gifs. Matejka and Fitzmaurice acknowledged and augmented the ancestor to all such constructs, Anscombe’s quartet, four datasets published in 1973 by Yale statistician F.J. Anscombe, who exhorted his contemporaries with the same message: Don’t just calculate, plot!

There are also links in the news article to the technical paper that Matejka and Fitzmaurice presented at an ACM conference, under the same title, in which they detailed the construction of The Datasaurus Dozen datasets. Here is their description of the novelty in their approach:

The key insight behind our approach is that while generating a dataset from scratch to have particular statistical properties is relatively difficult, it is relatively easy to take an existing dataset, modify it slightly, and maintain (nearly) the same statistical properties. With repetition, this process creates a dataset with a different visual appearance from the original, while maintaining the same statistical properties. Further, if the modifications to the dataset are biased to move the points towards a particular goal, the resulting graph can be directed towards a particular visual appearance.

In the most prominent of their results (below), Datasaurus is the seed dataset from which The Dozen were created:

AllDinosGrey 1

The graphical results

My immediate question when I saw these patterns was, what do they look like mapped into a triad? Here is the immediate answer for The Dozen (arrayed identically to the preceding figure; clickable for a much larger image):

And here is a side-by-side comparison of the rectilinear and ternary versions of Dinosaurus:

Dino grid+tern

The dotted green and yellow lines connect six equivalent points and clarify that the ternary Dinosaurus is facing right, that is, the vertical join between the two plots is a mirror plane of sorts. Details of the reason for a “re-scaled” plot and the construction of the triad from it are given in the Appendix.

So what (is the implication for a SenseMaker project)?

Mind you, I would never expect to see story dots forming even a star or a circle, let alone a dinosaur. When some other parameter, such as time or reward structure, is an independent (controlled) variable in a story-collection process, however, then unusual data structures may provide guidance in interpretation and, perhaps surprisingly, in prediction. But first a brief recap of the back-and-forth of the relevant transformations.

Part II: Log-Ratio Transformation in this series gave a foretaste of the relevance of looking at SenseMaker data in both ternary and rectilinear coordinates. As that post discussed, the fact that we can present summary statistics for a triad — to say nothing of more advanced metrics like kernel density estimates (see Part VIIIa) — is due to the methodology created by statistician John Aitchison (see References in Part II). In a nutshell, constant-sum ternary coordinates are transformed to open-ended (x,y) coordinates in a log-ratio space where standard statistical calculations can be performed reliably; and the results are then inverse-transformed back to the ternary. Part IV: Confidence Regions is an outcome of just such a procedure.

There are three log-ratio transformations in common usage, the additive (alr), centered (clr), and isometric (ilr). The first two were developed by Aitchison and the third by Vera Pawlowsky-Glahn and her collaborators (see References and Additional Readings in Part II). There is a clear and equation-free, though still highly mathematical, discussion of the pros and cons of each in the introduction of Egozcue et al. (2003), in which they first introduced the ilr transformation:

[It] is called isometric because it allows us to associate angles and distances in the [triad] to angles and distances in [the transformed rectilinear plot], where we feel more comfortable from an intuitive point of view. This is of particular interest with respect to concepts of orthogonality.

Here is an example from their paper of parallel (solid) and orthogonal (dashed) lines transformed between the two coordinate systems:

Egozcue parallels

Note that The Dinosaurus Dozen pair labelled “x_shape” (3rd row, 3rd column in each panel, above) shows exactly this behavior, but for two intersecting lines [1], rather than two parallel ones.

Egozcue et al. refer to the solid lines as “compositional processes” and cite bacterial growth, radioactive decay, and sedimentary deposition as natural examples of these patterns. In fact, as the names imply, the solid curves/lines are parametric in time, which varies independently along them, even though it does not appear as an explicit variable on any axis or vertex. In the research literature of the respective disciplines — biology, physics, geology — these processes are more likely to be shown in a log-ratio plot than a ternary.

On the other hand, here are three examples of parameterization in which a ternary is the more common visualization choice because the emphasis is on multi-component compositional change (rather than increase/decrease of a single “species” as in Egozcue et al.’s examples):

parametric in time (t): archeology, e.g., changing historical composition of tools or pottery shards as the “technology” of an era evolves or source areas and trade routes come in and out of favor;
parametric in temperature (T): metallurgy, e.g., changing equilibrium composition of alloys or solid solutions as temperature decreases; and
parametric in both t & T: geology, e.g., changing lava/magma composition during flow differentiation or fractional crystallization (see the example from Aitchison in Part II of this series, although it used the alr transformation, not the ilr).

There are at least two categories of SenseMaker projects that could show parameterized data. Firstly, punctuated or continuous capture of stories is inherently time-parametric. There would be no guarantee that signified results would actually change over time, but looking for such change is a primary motivation for the approach. It is also a means of testing whether adjustments of extrinsic factors, including safe-to-fail experiments, had detectable effects.

Secondly, the instrument itself might be parameterized as a means of testing some aspect of the methodology. Imagine a large, homogeneous population of respondents, subsets of whom were presented with different versions of a prompting question, yet attached to the same labelled triad. If those versions were designed to fall along some “spectrum,” then it would be interesting to see if there was a corresponding array of “compositions.”

As the first figure in this post reminds us, story data in a real SenseMaker triad are likely to be very blobby. Rarely will there be precise patterns that would warm the heart of a mathematician. The potential for parameterization — in time, reward structure, or some other study variable, or in the instrument itself — indicates, however, that the data could act as a directional pointer. Consideration of both ternary and rectilinear patterns, as in the pair of graphs immediately above, could suggest optimal pathways [2] along which respondents might move (or be moved) due to future interventions. Given the nature of complex problems and the inherent messiness of human nature, this would be, at best, a second-order effect. It would imply nonetheless a minor addendum to the continuing lesson of Datasaurus and The Datasaurus Dozen, offered here in the spirit of xkcd [3]:

Don’t just calculate, plot… and then connect!

Xkcd rexthor

Appendix: Transforming Datasaurus and The Datasaurus Dozen

Imagine that the data in a rectilinear plot, say, Datasaurus, are the result of a log-ratio transformation from some unseen ternary plot. The latter can be recovered by applying an inverse transformation to the (x,y) data, which I did subject to the following qualifications and comments:

The simplest approach was to use the alr transformation because it directly yields an increase from D-1 to D components (2 to 3 in this case). Additionally, my motivation was only to recover the general pattern of points, not to preserve metrical distances.
The axes on the original plots were scaled from 0 to 100. That range is unrealistic for the log-ratio data that the Aitchison methodology anticipates. For alr, the data transform from a closed triad with non-zero, constant-sum (100%) coordinates to two independent, open-ended variables, which in practice both fall in the range -5 to +5. Consequently, to simulate the log-ratio ranges, I re-scaled the original data in Matejka and Fitzmaurice’s CSV file so that the new results satisfied (x’,y’) = (0.1*x – 5, 0.1*y – 5). Here is a side-by-side comparison of Dinosaurus from their file and from my re-scaled data, showing that they are essentially identical; the orange data point is the arithmetic mean, which transforms to the geometric mean in a triad (see Part I: Geometric Mean):

As I mentioned in footnote [1], the original graphs had differing metrics on the two axes, resulting in rectangular rather than square gridlines. I matched the two metrics for the original “dino” image (above) which I replicated from the CSV file, so that there could be no doubt of fidelity between original and re-scaled images. I did the same for all of The Datasaurus Dozen as well, prior to the inverse transformation. The equal-metric originals from the CSV file and my re-scaled images are as identical as the Dinosaurus pair (above). Here is the array for The Dozen, re-scaled (clickable for a much larger image):

Comparing this array (just above) to the original one (and making allowance for the “flattening” of the latter) shows that there are small but noticeable differences in the patterns. Note especially “slant_down” (Row1,Col2), “slant_up” (R1,C3), “wide_lines” (R2,C1), and “high_lines” (R2,C2). In the first pair, the number of well-defined lines differs between the two sets; and in the second pair, the dispersion of each swath of points differs substantially. These differences could not have resulted from the re-scaling (see previous bullet). My guess is that the data in Figure 2 of their research news article and in the downloadable CSV file are simply from different simulation runs with the same summary statistics. (In fact, Justin Matejka confirmed my guess in an email exchange on August 21.)
Finally, I converted the re-scaled (x’,y’) data to triad coordinates by the inverse alr transformation [4], using the form equivalent to eqs. (4)-(6) in Part II. The image array is shown above at Graphical Results.

References

Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueras, G., and Barcelo-Vidal, C. (2003) Isometric logratio transformation for compositional data analysis. Mathematical Geology, v. 35, p. 279-300.

Martín-Fernández, J.A., Olea-Meneses, R.A., and Pawlowsky-Glahn, V. (2001) Criteria to compare estimation methods of regionalized compositions. Mathematical Geology, v. 33, p. 889-909.

Pawlowsky, V. (1989) Cokriging of regionalized compositions. Mathematical Geology, v. 21, p. 513-521.

Footnotes

Dinosaurus and The Dozen were presented in graphs with differing (x,y) metrics, resulting in rectangular rather than square gridlines. One result is that the circle and bullseye appear ellipsoidal rather than, well, circular. With the benefit of hindsight from Egozcue et al.’s discussion, however, the ternary for “x_shape” is consistent with the two lines being non-perpendicular; this is also confirmed by the re-scaled plots discussed in the Appendix. ^
“Optimal pathways” is my shorthand for Egozcue et al.’s discussion of Hilbert space and geodesics and Aitchison distances. Suffice to say that the graphical patterns of data might point to the “shortest” way to get people to change their stories and move their signifiers. ^
I am grateful to Randall Munroe for rendering the body and head of Rexthor and his dog as triads (albeit non-equilateral ones). Now if only the AKC would recognize the Directional Pointer. ^
Applying the alr transformation to data in a triad requires that one of the three variables be chosen as the divisor of the other two in calculating the log ratios. Not surprisingly, this well-known asymmetry can lead to different results in the log-ratio graph for each divisor. If you’re more comfortable looking at your data in a rectilinear plot than in a ternary, this is probably pretty unsettling. But if you stick with it, something surprising emerges — any computational results that are inverse-transformed back to the originating triad are the same, regardless of the divisor.

In math-speak, alr inverse-transformed values are “invariant under permutation.” It took Pawlowsky (1989) two-plus pages of moderately dense linear algebra to prove this. Thankfully, she re-stated it in prose a dozen years later (Martín-Fernández et al., 2001): ” …one important property of the alr transformation is the independence of results from the selection of the denominator after the [inverse] transformation has been applied.”

I bring this up in the interest of full disclosure. I have indeed inverse-transformed some data, but from a supposed log-ratio plot to a previously non-existent ternary. Said differently, I have done at best only the inverse half of what Pawlowsky discussed, and therefore I can’t say that my half by itself is provably permutation-invariant. What I can say, however, is that some simple algebra with the logarithmic and exponential terms of the two transformations suggests that changing the divisor is simply a matter of arbitrarily choosing how to map the two log-ratio coordinates onto the three triad coordinates. The pattern of data points should remain unchanged, but it will be rotated 120° inside the triangle. In the case of the motivating image for all this, it should still be the same Dinosaurus, but now face-down in one corner or lying on its back in the other. Here is the experimental confirmation (original on left, face-down on right): ^