• Statistics in the Triad, Part IV: Confidence Regions

    When we compare two (or more) groups of data on almost any kind of graphical presentation — histogram, box plot, x-y grid, time series, rose diagram, whatever — a near-universal question arises:  Are the groups significantly different?  The familiar answer is given by error bars or confidence intervals in x-y plots or bar charts, assuming an appropriate statistical model, for example a normal distribution, for the data set.

    The answer for groups of data in a ternary plot or triad, however, is not as straightforward.  The reason for this is rooted in the fly-in-the-ointment issue that I covered at the beginning of Part II of this series, the constant-sum or closure constraint. Whether the coordinate triplets (x, y, z) for each point are extracted from an overall composition and normalized to 100% before plotting, or whether (as in a SenseMaker project) they are constrained to 100% at the moment of data collection, for example, by a finger touching a tablet screen, two restrictions now apply.  Firstly, all the data are non-negative and also capped at 1.0 (100%), so any statistical method that assumes non-bounded data (plausibly -∞ to +∞) cannot be used. Secondly, only two of the three components are independent, which can lead to spurious negative correlations and other aberrations if the data are not handled carefully.

    Don’t try this at home… or on your computer!

    As a geologist, I have mixed feelings about my former profession having pioneered the search for graphical portrayals of confidence limits in ternary plots. The only mitigating circumstance is that the method in most widespread use for several decades was introduced before Chayes (1960) issued the first modern warning about the limitations of working with compositional data.  The names of this method — most commonly “hexagonal field of variation” or “error polygon” — are accurate descriptions.

    Here is an extreme, but not-all-that-atypical, example from Weltje (2002, Fig. 16):

    weltje_hex_field+alr_field

    The left-hand image (A) shows two sets of nested hexagons for 90%, 95%, and 99% confidence intervals, the inner set for the population mean (+) and the outer set for the overall population of samples of river sands.  (This kind of diagram was used across a wide range of sub-specialties in the earth sciences. Hydrologists and sedimentary petrologists, who might study such samples, are no more or less guilty than many others.)  The methodology is both simple and specious:

    • calculate the arithmetic mean of the data;
    • calculate the standard deviation (σ) for each of the three individual components (Qt = quartz, Rnc = Rock fragments/”rest” [my shorthand], Rc = Rock fragments/carbonate) ;
    • plot the pair of parallel lines defining the appropriate variance window (e.g., ±2σ = 95%) for each component; and
    • truncate each line where it intersects those for the other two components at the same confidence level.

    If you have read the first two posts in this series, the italicized phrase in each bullet highlights the problem with this type of plot: it uses the arithmetic mean, rather than the geometric mean (see Part I), and it uses statistical measures that are inappropriate under the closure constraint (see Part II). If you prefer a less abstract, more visual definition of the problem, look at the hexagonal fields themselves.  It should not inspire a lot of confidence (pun intended) when the limits for 90-99% of the data extend beyond what is mathematically allowed (i.e., could fall outside the area of the closed triad)!  And yet this was not only a widely-used method, but one whose shortcomings were openly acknowledged in the academic papers that presented the results, with phrases such as “the uncertainties are not statistically rigorous” (see Pawlowsky-Glahn and Barcelo-Vidal, 1999).  Absent any other approach, the rule was clearly desperate scientists call for desperate methods.  Nonetheless, as Weltje (2002) succinctly put it, “hexagonal fields of variation must be regarded as mere graphic constructs.”

    Contrast this with Weltje’s right-hand image (B), which shows the same data for river sands, with closed contours for the same three confidence levels.  Again the inner set is for the population mean (+), but notice that, compared to the mean in image A, it is now slightly shifted to the left of the two nearest data points.  In A, the + was the arithmetic mean; in B, it is the geometric mean.  The outer set of contours is again for the overall population of samples.

    Weltje (2002) developed this methodology for confidence regions in the triad as a direct extension of Aitchison’s (1986) approach to compositional data.  His historical discussion and examples are entirely geological, but his attention to the rationale is lucid, and the math is relatively accessible (moderate facility with linear algebra required).  Sooner rather than later, I will add an appendix (or another post) to discuss Weltje’s “unit of observation” and how it relates conceptually to the story or narrative fragment, but for now let’s look at some results for SenseMaker projects.

    We did try this at home

    In the 15 years subsequent to Weltje’s paper, the math has been amplified and extended by Pawlowsky-Glahn and colleagues (see the Additional Readings at the end of Part II).  More importantly, within the open-source R community, there are now packages that implement various formulations of Aitchison’s log-ratio transformations for compositional data, including confidence regions and the plotting thereof.  So, having revived our command-line skills, and with considerable assistance from Ashton Drew, we jumped into R.

    Essentially every recent project on which Laurie has worked has included at least one or two triads in which subsets of data — cohorts within the population of respondents — could prompt the opening question:  Are the groups significantly different?  As a proof-of-concept trial, Laurie reviewed the triads from a study of employees in a bilingual, multi-national corporation and picked two by eye that she thought might show a difference for cohorts based on native language.  Here are the raw data for the first one:

    The lead-in was “What would have made the situation even better?” See the labels for the corners below.

    And here is the plot for the 95% confidence regions on the (geometric) means, which are clearly significantly different:

    Here are the raw data for the second triad:

    The lead-in was “The overall theme in this story revolves around…” See the labels for the corners below.

    Again, here is the plot for the 95% confidence regions on the (geometric) means, which are also significantly different:

    Arguably the most interesting results are likely to arise when comparisons involve multiple cohorts, offering the possibility of simultaneously identifying similar (overlapping) and different (non-overlapping) responses to the same prompts and lead-ins.  Here are two examples from a project by a not-for-profit organization concerning a refugee population that included (among others) unmarried and married girls (ages 13-24), mothers and fathers of the girls, husbands of the girls, and unmarried men (otherwise similar to the husbands).

    Since the point here is solely to document the ability of this technique to make distinctions (or not), I have omitted some labelling details, including the lead-in, as a privacy/security consideration.  Even with that limitation, it is clear that the perspectives of the husbands and their unmarried counterparts differ significantly from those of the core family members.  The distinctions are even more evident in the second triad for this study:

    You don’t have to know anything about the society or culture involved to recognize the disparity in value placed on education by the various groups.  And to appreciate the guidance that such clear results might provide for the client and supporting agencies, to say nothing of the benefit that might accrue to the unmarried girls in the long run.

     

    References

    Aitchison, J. (1986, reprinted 2003) The Statistical Analysis of Compositional Data. The Blackburn Press, Caldwell NJ. 416 pp. plus additional material.

    Chayes, F. (1960) On correlation between variables of constant sum. Journal of Geophysical Research, v. 65, p. 4185-4193.

    Pawlowsky-Glahn, V., and Barceló-Vidal, C. (1999) Confidence regions in ternary diagrams. In “Old Crust – New Problems,” Freiberg ’99.  Geologische Vereinigung, v. 89, p. 37-47.

    Weltje, G.J. (2002) Quantitative analysis of detrital modes:  statistically rigorous confidence regions in ternary diagrams and their use in sedimentary petrology.  Earth-Science Reviews, v. 57, p. 211-253.

You must be logged in to post a comment.