• Statistics in the Triad, Part I: Geometric Mean

    The ternary plot, better known in the Cogniverse as a triad, is familiar to users of SenseMaker as a tool for both data collection and data display. Its three vertices usually denote potential attributes, among which respondents can choose in any proportions, to amplify or augment their responses or reactions to some prompt about which they have told stories.

    Collector (L) and results (R) for a triad from Health Workforce 2015 - Oral Health, Health Workforce Australia (click image for QED and HWA links)

    Collector (L) and results (R) for a triad from Health Workforce 2015 – Oral Health, Health Workforce Australia (click image for QED and HWA links)

    One question that invariably arises in both analysis and interpretation is “What is the average response?”, whether of the entire sample population or some particular demographic cohort(s). Unless you have a Ph.D. in statistics, or a tame statistician assisting you, the answer to that question and how to calculate “the right average” for data in a triad may not be obvious.

    Familiar territory: the arithmetic mean

    For most of us, the immediate answer for any collection of numbers is the arithmetic mean: add up all the individual values and divide by the number of them. If you have responses on some 1-10 scale that total 6110 from 837 people, then the average (the arithmetic mean) is 6110 ÷ 837 = 7.3. But what does that really mean? One answer would be, well, we could replace all the individual responses with 7.3, and the net result would be the same. That’s a numerically correct answer, but for CE practitioners a socially oblivious one. It’s understanding the diversity of people and their responses that is at the heart of any SenseMaker project!

    Instead, let’s ask a somewhat deeper question: What is it that this average, or any average, is trying to find? The statistician’s answer would be that it’s a measure of central tendency. It’s a search for one number that is minimally-distant from all those respondents’ values, so we’re looking for a single number that is as close as possible to all those responses (even if that number itself, say 7.3, doesn’t exactly match any of the respondents’ values, which might be only the surrounding integers …5, 6, 7, 8…).

    In the two previous paragraphs, an unstated assumption crept in: we can work with the responses and the result we’re looking for on the number line, the visual representation of all real numbers on a straight line, extending in principle from -∞ through 0 to +∞. This creates an obvious meaning for distance, as in “minimally-distant,” so finding the arithmetic mean is then just a matter of finding the one point on the number line whose cumulative distance from all the responses to its left and right is smallest.

    One possibility is brute force — try lots of points on the line and see which gives the smallest result. Another is to open iTunes and see if there’s an app for that. Or we can be a little more thoughtful and do what a mathematician would do: derive something. Those of you who have forgotten everything from first-year calculus, and those who never took it, can relax. We’re just going to state a few expressions/equations and the end result, with some sentences to connect them. (Those who actually remember the details can picture what the omitted steps would be anyway.) Despite that disclaimer, to appreciate the argument fully does require some modest comfort and facility with the algebraic manipulations. It is a mathematical topic, after all.

    We might begin by saying let’s just add up — indicated by the summation sign { \Sigma } — all those differences between the {n} individual points, {x}_{i}, on the number line, and the value we want to know, \overset { \_ }{ x }, the arithmetic mean. That would look like this:

    (1)   \begin{equation*} \overset { n }{ \underset { i=1 }{ \Sigma } } \left( { x }_{ i }-\overset { \_ }{ x } \right) \end{equation*}

    This form would be inappropriate, however, because we don’t care about the sign (+/-) of each individual difference. But it’s fixable by using the absolute value of each (in effect, treating all {x}_{i} as positive):

    (2)   \begin{equation*} \overset { n }{ \underset { i=1 }{ \Sigma } } \left( |{ x }_{ i }-\overset { \_ }{ x } | \right) \end{equation*}

    Unfortunately, if we don’t already know the value of \overset { \_ }{ x } — which of course we don’t, since it’s the very thing we’re trying to determine! — we can’t do the next step in the usual derivation. (For the knowledgable reader, the expression is discontinuous at its minimum, so we can’t find a zero derivative.) There’s another standard workaround though — square each distance:

    (3)   \begin{equation*} \overset { n }{ \underset { i=1 }{ \Sigma } } {\left( { x }_{ i }-\overset { \_ }{ x } \right)}^{ 2 } \end{equation*}

    This expression keeps the proper relative position of each point compared to the mean; generates only positive values; and solves the problem we just encountered (because it has a continuous first derivative). This also illustrates the kind of expression that arises in generating a “least-squares line” or “least-squares fit,” one of the simplest and most common tools in data analysis.

    These three expressions are the crucial part of what we need as a comparative basis for discussing the geometric mean. So we can skip the remaining steps in the derivation — expanding (3), differentiating, setting the result equal to zero, and then some algebra — and go to the final result:

    (4)   \begin{equation*} \overset { \_ }{ x } =\frac { 1 }{ n } \overset { n }{ \underset { i=1 }{ \Sigma } }{ x }_{ i } \end{equation*}

    In words, the arithmetic mean is the sum of all the individual values divided by the number of values. Just what we already knew, but arrived at in a way — looking for the minimally-distant point on the number line — that most of us never consider. We now turn to applying that same approach to the geometric mean.

    Not-so-familiar territory: the geometric mean

    Here is the simplest, if not necessarily most useful, way to distinguish between the two means:
    • If we’re adding a group of numbers, the arithmetic mean is the single number that could replace all the individual terms and the sum would be unchanged; whereas,
    • If we’re multiplying a group of numbers, the geometric mean is the single number that could replace all the individual terms and the product would be unchanged.
    Keep in mind that subtraction is just addition of a negative number; division is just multiplication with the reciprocal of a number; and a fractional power is just a root.

    Here are four-term examples of each mean (labelled by the subscript and dropping the overset bar for simplicity):

        \[ { x }_{ AM }={ \left( \frac { { x }_{ 1 }+{ x }_{ 2 }+{ x }_{ 3 }+{ x }_{ 4 } }{ 4 } \right) }\quad and\quad { x }_{ GM }={ \sqrt [ 4 ]{ { x }_{ 1 }{ \ast x }_{ 2 }{ \ast x }_{ 3 }{ \ast x }_{ 4 } } } \]

    Or in slightly more compact form:

        \[ { x }_{ AM }=\frac { 1 }{ 4 } { \left( { x }_{ 1 }+{ x }_{ 2 }+{ x }_{ 3 }+{ x }_{ 4 } \right) }\quad and\quad { x }_{ GM }={ \left( { x }_{ 1 }{ \ast x }_{ 2 }{ \ast x }_{ 3 }{ \ast x }_{ 4 } \right) }^{ \frac { 1 }{ 4 } } \]

    The fully-generalized forms for {n} terms look like this:

    (5)   \begin{equation*} { x }_{ AM }=\frac { 1 }{ n } \overset { n }{ \underset { i=1 }{ \Sigma } } { x }_{ i }\quad and\quad { x }_{ GM }={ \left( \prod _{ i=1 }^{ n }{ { x }_{ i } } \right) }^{ \frac { 1 }{ n } } \end{equation*}

    where the giant {\Pi} indicates pi-for-product, analogous to the { \Sigma } as sigma-for-sum.

    Presumably everyone reading this post has experience with, and therefore a good intuitive sense of, the arithmetic mean. We don’t have to think about it, we just use it. By contrast, applications of the geometric mean arise in domains of specialized knowledge: compound growth rates, including investing; changes in social statistics; metrics for aspect ratios in print and visual media; signal processing; and others.
     
    What most of these applications have in common is “normalization” of the data, for example, dividing by some reference value. This is exactly the situation in a triad: the coordinates for each data point in the equilateral triangle are calculated as ratios to the sum of the three input values — that’s the normalization step — and then expressed as percents. So, no question about it, the geometric mean is the way to go (see Endnote).

    We’ll explore how this happens in practice — how do we actually calculate the geometric mean for a population or cohort in a triad? — in Part II of this post. But for now we have one small loose end to resolve. Notice that the left-hand equation in (5) is the same as (4). In turn, we got to (4) by talking our way down through expressions (1)-(3). What is the equivalent path for the right-hand equation? Asked differently, for the geometric mean, what is the equivalent to the standard number line, the frame of reference along which the cumulative, minimally-distant differences will be measured?

    To answer this, we can use an identity about the logarithm (log) of numbers — the log of a ratio of two numbers is equal to the difference of their individual logs:

    (6)   \begin{equation*} \log { \left( \frac { a }{ b } \right) } =\log { a-\log { b } } \end{equation*}

    This is true for a logarithm to any base (2, {e}, and 10 being the most common). We’re going to use natural logarithms (base {e}, written {ln}), which makes (6) look like this:

    (7)   \begin{equation*} \ln { \left( \frac { a }{ b }  \right)  } =\ln { a } -\ln { b }  \end{equation*}

    This identity gives us a way to deal with ratios of values, including normalized coordinates in a triad. So we can go back to expression (3) and use logs of normalized values, including the mean:

    (8)   \begin{equation*} \sum _{ i=1 }^{ n }{ \left( \ln { \left( \frac { { x }_{ i } }{ { x }_{ 0 } }  \right)  } -\ln { \left( \frac { { \overset { \_  }{ x }  } }{ { x }_{ 0 } }  \right)  }  \right) ^{ 2 } }  \end{equation*}

    where {{ x }_{ 0 }} is the normalizing factor.

    Now we can expand expression (8) using equation (7) to replace each of the two terms:

    (9)   \begin{equation*} \sum _{ i=1 }^{ n }{ \left( \ln { { x }_{ i } } -\ln { { x }_{ 0 } } -\left( \ln { { \overset { \_  }{ x }  }-\ln { { x }_{ 0 } }  }  \right)  \right) ^{ 2 } }  \end{equation*}

    The two terms with the normalizing factor, {{ x }_{ 0 }}, cancel each other, which leaves an expression like (3), but now a least-squares fit based on natural logarithms:

    (10)   \begin{equation*} \sum _{ i=1 }^{ n }{ \left( \ln { { x }_{ i } } -\ln { { \overset { \_  }{ x }  } }  \right) ^{ 2 } }  \end{equation*}

    As above for the arithmetic mean, we’ll skip the intervening steps in the derivation — expanding (10), differentiating, setting the result equal to zero (to find the minimum), and then more algebra — to reach the result analogous to (4):

    (11)   \begin{equation*} \ln { \overset { \_  }{ x }  } =\frac { 1 }{ n } \sum _{ i=1 }^{ n }{ \ln { { x }_{ i } }  }  \end{equation*}

    Finally, recalling that exp\left( \ln { a }  \right) =a and using the identity { a }^{ b }=\left( b\log { a }  \right), we get the product form for the geometric mean (same as the right-hand side of (5)):

    (12)   \begin{equation*} { x }_{ GM }=\overset { \_  }{ x } ={ \left( \prod _{ i=1 }^{ n }{ { x }_{ i } } \right) }^{ \frac { 1 }{ n } } \end{equation*}

    Amadeus_Well_there_it_is

    So as Leopold II said in Amadeus, “Well, there it is.” We have derived in (12) the standard form of the geometric mean as the {n}th root of the product of {n} numbers. In the prior step (11), we saw a way to view it as a minimized, least-squares sum of distances on the logarithmic number line. Part II will discuss how this abstraction translates into practice.

     


     

    Endnote:

    If you want to pursue this further, there is a classic computer science paper that offers a good worked example: Fleming, P.J., and Wallace, J.J., 1986, How not to lie with statistics: the correct way to summarize benchmark results, Communications of the ACM, v. 29. no. 3, p. 218-221. (This link is firewalled, but a Google search on the title of the article will turn up copies of the PDF as part of CS course syllabi at several universities around the world.) The paper provides an empirical and theoretical explanation of the necessity of using the geometric mean (rather than the arithmetic mean) in processor benchmarks. They compare three processors for several performance criteria, and, though they didn’t use the display, their data are ready-made for ternary plots. As a result, it is relatively easy to follow their discussion and “think triad,” with straightforward application of their results to another perspective on what is presented here.^

You must be logged in to post a comment.