The two previous posts in this series looked at where sensemaking story data are concentrated in a ternary. Part VIIIa explored the use of a non-parametric method for calculating smooth (continuous) contour lines, essentially a data-driven “guess” at the density of story points. Part VIIIb gave examples of a simpler, albeit less elegant, alternative — overlay a grid of bins on the triad, count the number of story dots in each bin, and then color-code the bins. The result is a two-dimensional histogram or heat map, a discrete view of the data density distribution.
Imagine now a scenario in which the question is not where are the data concentrated within a particular triad, but rather by how much? And further, what is a quantitative measure of overall “how-muchness” that could be used to compare one triad with another? Let’s look at a couple of examples to clarify the difference between the two questions and to motivate why we might want to answer them for a real-world sensemaking project.

This figure compares a story-data triad (left) with a ternary plot containing 500 random points (see Part III of this series). No smoothing or binning is necessary to confirm the typical clustering of story dots at the vertices, center, and midpoints of the legs. By contrast, there is no “where” there in the random plot, for the simple reason that it was constructed to reduce the likelihood of any concentration of points. Hence, not only minimal “where,” but also minimal “how-muchness.”

This second figure compares two triads from the same sensemaking project. The left-hand one is the same as in the first figure. The difference in the data distribution — the “where” — for the right-hand one is evident on visual inspection. What is not so immediately obvious is which of the two has a higher overall data concentration. It would be possible to get some sense of the answer from contours (but see the caution in Part VIIIa) or bin counts or a “lasso” tool.
But even with those aids, the degree of clustering and overprinting leaves some uncertainty. Such ambiguity may be especially significant for a client – perhaps one who has already done multiple story-collection phases and safe-to-fail experiments – faced with a what-do-we-do-next decision that might be encapsulated in two side-by-side triads. The rationale for identifying a single-value “how-muchness” metric for story data parallels this generalization by Parrott (2010) concerning an analogous measure for an ecosystem:
- The search for an appropriate measure of ecological complexity is similar to the search for any holistic indicator of ecosystem state: managers and policy makers need a limited number of indicators upon which they can base environmental assessments and policy statements. Appropriate ecological indicators may provide leverage for conservation, preservation and also rehabilitation projects.
Any extra insight — however subtle the “holistic indicator” — might be useful to those “managers and policy makers” who are often the clients of a sensemaking project. The balance of this post describes such a metric that is widely used in other fields and outlines how it is calculated. After we see an example with actual story data, we will find that there are two quite different ways in which the result could be used by a client to determine next steps. [1]
So what (do we want to measure)?
Many disciplines ask a question equivalent to “How much are the data concentrated overall?” The richest source I have found is in bio-diversity, ironically because there has been so much misunderstanding and confusion in language and usage, ultimately leading to some very lucid primary articles and reviews. There are also useful sources in landscape ecology (see Additional Readings, below), physical and social geography, business, economics, statistics, information theory, computer science, chemistry, and physics. Page (2011, see especially the Prelude and Chaps. 1-2) discusses examples from several of these fields, emphasizing the interplay of diversity and complexity.
To clarify the concept of “data concentration,” here are some pairings of terms that you can find in the literature of those fields, nominally corresponding to a spectrum from higher to lower density (left to right):
- concentration — dispersion
- dominance — diversity
- certainty — uncertainty
- coherence — fragmentation
- heterogeneity — homogeneity
- connection — scatter
- utility — dissipation
- peakedness — flatness
- order — disorder
In some fields, one-half of a pair may be used preferentially, for example, in bio-diversity literature, the contrast more often seems to be expressed as lower vs. higher “diversity.” Additionally, there is a fair amount of mixing and matching, such as “concentration — scatter.” For our purposes, think of these pairings as a thesaurus-like menu to guide understanding, not rigid labels to be adhered to.
A property vs. an index of that property
Amazingly, these disciplines share a common metric or index for quantifying the property that is described by these verbal pairings. To introduce that metric, and to emphasize the property-vs.-index distinction, here is an excerpt from Jost (2006) writing about “diversity” in an ecology journal:
- Diversity… has been confounded with the indices used to measure it; a diversity index is not necessarily itself a “diversity”. The radius of a sphere is an index of its volume but is not itself the volume, and using the radius in place of the volume in engineering equations will give dangerously misleading results. This is what biologists have done with diversity indices. The most common diversity measure, the Shannon-Wiener index, is an entropy, giving the uncertainty in the outcome of a sampling process…. Entropies are reasonable indices of diversity, but this is no reason to claim that entropy is diversity.
Look again at the right-hand ternary in the first figure, the one with the 500 random data points. Relative to the story-data triad, those 500 points are dispersed, diverse, fragmented, scattered, or whichever term you find most meaningful going down the right-hand side of those pairings. There is a universal way to quantify that verbal property, however, and it is by calculating the value of the entropy (see below). The random pattern above will have a high value of the index (entropy) for all those right-hand-side equivalent semantic descriptors of the property. Or, operationalizing the quote from Jost, if you were blindfolded and reached into the random triad to pick a point, there would be a high uncertainty — indexed by a high entropy value — as to which one you would withdraw. “Surprise!”

The same logic applies in sampling stories from a triad (the left-hand side of the first figure). Imagine that you have a triangular bottle, filled with narrative fragments that are concentrated in those clusters at the vertices, center, and midpoints of the legs; and additionally each fragment is encoded with its location in the triad. If you shake a story out of that bottle, it is very likely that it will be from one of those seven typical locations. No surprise there. That high certainty (low uncertainty) would be indexed by a low entropy value relative to the random plot. As a result, you gain little new information about the overall pattern by extracting that fragment. Information – that’s the cue (and queue) for Claude Shannon.
May I have the number please?
The term “entropy” was introduced into the jargon of mid- to late-19th century physics as part of the mathematical development of classical thermodynamics (by Clausius) and then statistical mechanics (by Boltzmann and Gibbs). This was a steampunk-inspiring world preoccupied literally and figuratively with conversion of energy between heat and mechanical work, with the efficiency of steam engines, and with gas molecules zipping around in their containers. There entropy could have stayed, contributing to the befuddlement of generations of students, for whom it is a consummate abstraction, lacking the palpable manifestation of other thermodynamic properties such as temperature and volume.[2]
Instead, it appeared again independently in the mid-20th century when Claude Shannon at Bell Labs was working out the theoretical principles of what he called “communication theory,” now better known as information theory. He was thinking about transmitting strings of characters down a communication channel — which in 1948 meant a phone line to most people — and about the ability of “information,” very narrowly and technically defined, to reduce ambiguity for the recipient of a message. It is a measure of Shannon’s genius that the generality of his results governs our wired and wireless technologies today.
-
The core idea of information theory is that the “informational value” of a communicated message depends on the degree to which the content of the message is surprising. If a highly likely event occurs, the message carries very little information. On the other hand, if a highly unlikely event occurs, the message is much more informative.
We previously encountered this element of surprise in Part VI. The Story as Unit of Observation when this sentence appeared: “It was a dark and stormy avocado.” With implied apologies to Snoopy and Charles Schulz, I pointed out that we can understand a lot in a sensemaking project by reading individual stories, without the need to deal (consciously) with each individual word. Shannon, on the other hand, was necessarily dealing with individual letters (and other symbols).
Here is a better metaphor that could be a literal example if you only sample the channel annually: Getting a dozen cards on your birthday from friends and family is nice, but probably not surprising because it happens every year. Hence, it doesn’t tell you much that you didn’t already know, regardless of confirmatory value. But ten or a dozen people jumping out from behind the furniture when you get home and shouting “Surprise!” — with banners and presents and a cake — is probably full of unexpected information, to say nothing of affection and a jolt of adrenalin.
This should sound familiar because it is precisely the same argument that we applied above to the element of surprise that we might (or might not) encounter in shaking a narrative fragment out of the triangular green bottle. In fact, serially performing that act, one story fragment at a time, is logically identical to the one-character-at-a-time transmission problem Shannon was addressing. Which means we can adopt his mathematical formalism, or at least the simplest version of it, to evaluate the entropy of a story-data ternary[3].
Derivations? Where we’re going we don’t need derivations.
With apologies to Dr. Emmett Brown[4]….

Shannon evidently considered using the terms “uncertainty” and “surprise” for what is now called “information entropy” or “Shannon entropy” or most often simply “entropy” [5]. His derivation of an equation for entropy, commonly denoted by “S,” was formal and axiomatic (with pre-defined starting rules). As an example, one axiom was that the measure or index must be positive. A second was that each event — a character in the queue, a story dot in a ternary — was independent of all others. There are many replications and explanations of Shannon’s original derivation, to say nothing of specialized or generalized variants of the concept itself. One of the most useful and accessible discussions is Schneider (2014/2018), which requires only minimal mathematical comfort (algebra, analytic geometry); Stone (2015) is excellent but a significant step up in mathematical expectation (multivariate calculus).[6]
Rather than a formal derivation per se, we’ll look at Shannon’s approach; and then rationalize how it can apply to a sensemaking triad, specifically a tribin heat map. Shannon had two things to work with in addition to his axioms:
- an “alphabet” of M symbols — letters, numbers, etc.; and
- a “message” of N characters, made up of a mix of those symbols, that emerged sequentially from the channel.
Initially, Shannon considered a measure of information content for M equally-likely symbols in his “alphabet,” so the first character in the sequence would have a probability of simply P = 1/M. Then, when the next character appeared, he needed a rule for incrementing the amount of information. The familiar ways of combining probabilities are multiplicative — a fair coin with P = 1/2 for both heads and tails will have P = 1/2 x 1/2 = 1/4 for any combination of two successive flips. But Shannon wanted his index to be additive, so additivity became a third axiom. If you read one book, you might gain 2 units of information; if you read a second that is much more informative, you might gain 12 more units, for an additive total of 14. It usually wouldn’t occur to us to say it was 2 x 12 = 24 units.
Shannon then proved that the only way to satisfy all of his axioms was to use the logarithm of the probability, log(P) = log(1/M), instead of the “bare” probability, as the core measure of information. Now we can imagine a message of N characters arriving one symbol at a time in the channel, and we can drop each of the characters carrying log(1/M) worth of information into one of M buckets where they accumulate. There is only one flaw in the plan: for any alphabet, including Shannon’s and ours, there will not be a single, universal probability that applies to all symbols in the channel. In the Latin alphabet, the frequency of occurrence of “A” and “Z” in English words, for example, differs substantially — “pizzazz” is a very rare and surprising combo — with the result that the “A” bucket will generally be much fuller than the “Z” bucket. So we need to account for the frequency of occurrence of M different symbols as the N characters in the string arrive, in addition to their information value.
This kind of summation is identical to both the logic and the arithmetic that we use when we open a piggybank; count the number (frequency) of each kind of coin (symbol); multiply the count times the face value (information) for each; and add those product terms together:
Total ($,€,¥…) = (# coin 1)x(value coin 1) + … + (# coin M)x(value coin M)
The equivalent for information is Shannon’s defining equation for entropy, S, where pi is the probability of the ith symbol appearing in the sequence of N characters, and the summation is for all i = 1 to M possible symbols:
S = – ∑ (pi) log(pi)
In words, this says that the entropy is the sum of the logs of the individual probabilities (“values”), weighted by their respective probabilities (“frequencies”). There is an implicit trade-off in this equation that is arguably the most important mathematical aspect of this post to grasp: as pi decreases, log(pi) increases (but in the negative direction).

As this graph of the log function shows, for any of the common logarithmic bases (b = 2, e, or 10), when x = 1, log(x) = 0; and as x decreases toward 0, log(x) becomes increasingly negative. This has two implications for application of Shannon’s formula, whether to a character string or story data:
- As the likelihood of a symbol appearing in the sequence of characters decreases, its probability pi goes from 1 toward 0 on the x-axis, while log(pi) increases in a negative direction on the y-axis. That is the frequency-information tradeoff.
- Since probability is by definition between 0 and 1, the logarithm of probability will always be negative. Therefore all the weighted terms and their sum will be negative, in seeming violation of Shannon’s positivity axiom. But no problem: Shannon fixed it by simply putting a minus sign in the front of the equation. If you invent it, you get to make up the rules.
In order to see how this will work with story data in a ternary/tribin (represented by the green triangular bottle), we need to recognize these parallels, starting a bit farther “upstream” than we might normally do. In the comparison of these five components, we are looking at the right-hand side of an “Information-is-to-Sensemaking-as-…” construct, that is, information : sensemaking = …
- for the author … = caller/sender : storyteller
- for the source … = communication channel : triangular bottle
- for delivery … = electronic signals : story fragments
- for the alphabet … = symbols : tribins
- for the message … = characters : signifier coordinates (a,b,c)
When I introduced the green bottle metaphor above, I added that each fragment is encoded with its location in the triad — those are the signifier coordinates. Hence, as a story fragment emerges from the bottle, we can place it in the corresponding bin, as in the figure below. This is exactly the same as placing an “A” or “Z” in the appropriate bucket as the characters emerge from the communication channel, which allows determining the frequency of each symbol simply by counting the number of that character in its bucket after the message has arrived.

The two plots above are for the same signifier, with N = 353 responses (stories). For a triad (left) divided into M = 100 tribins (right), as in this example with 10% grid size, each bin contains some number of story points, ni, ranging from 0 to N (the latter only in the unlikely event that all N stories were in a single bin). For the ith bin, the probability of finding a story dot is simply pi = ni/N. The entropy is then the sum of those individual logarithmic terms, each “weighted” by its respective probability (see the defining equation for S above). We’ll look at a numerical comparison of the result for two ternaries in the following section.
An example
The potentially unfamiliar math notwithstanding, the calculation of S can be done in a spreadsheet:
- The number of story dots in each bin, ni, is extracted from the data file;
- the probability of finding a dot in that bin is calculated from pi = ni/N;
- the log of the probability is calculated and multiplied (“weighted”) by the probability, which gives the contribution to the entropy for that bin;
- the bin-entropies are then summed; and
- the sum is “negated,” which makes it positive as Shannon specified.
That sum is the overall entropy S for the triad, a single number that is an index (entropy) of a property (the overall concentration/dispersion of data) — the “how-muchness” metric — that we set out to find. Q.E.D. [7]
As an illustration, here are ternary data for two signifiers, T1 and T5, from a study that Laurie conducted for a healthcare client with a total of 738 respondents:

At first glance, it is easy to notice the data points scattered around in the “middle” of T1 (away from the vertices, center, and midpoints of the legs)… so it just looks more dispersed or fragmented than T5, like the random plot at the beginning of the post… so surely it’s the one with higher uncertainty (entropy), the one that the client might focus on, right? Maybe, maybe not. What if I now told you that, out of the 738 study respondents, 675 (92%) of them signified T1, but only 446 (60%) signified T5? Maybe if those other 200+ people had found T5 relevant to their stories, there would have been more data points scattered out in the middle, because it’s not like there aren’t some there already! Conversely, perhaps a disproportionate fraction of the additional 200+ points in T1 are concentrated at the centroid and the top and bottom-right vertices, where their presence is obscured by overprinting… in which case T1 could be more, well, concentrated. The only “obvious” point is that you can’t necessarily decide by visual inspection.
Here are the results for the entropy calculation, using natural logarithms (but the ranking would be the same for any other base, such as 10):
- S(T1) = 3.5042 — higher uncertainty, scatter, etc.
- S(T5) = 3.4666 — higher certainty, concentration, etc.
If you chose T1, stop congratulating yourself, for precisely the reason explained in the previous paragraph — visual inspection can easily be misleading. On the other hand, if you chose T5 and are now complaining because the two numbers are “almost the same,” here is something else to consider.
The entropy equation is a sum of the individual bin-entropies, but at the “root” level in the algebra it is a weighted sum of logarithms. So comparing two values calculated in this way is best understood by recognizing the “scaling” effect of any quantity expressed by logarithms. Here is a very different type of comparison — the number of subscribers to Amazon Prime, A, during two early years of its growth, expressed as base-10 logarithms:
- For 2011: log10(A) = 6.6021
- For 2012: log10(A) = 6.8451
Notice that you could have the same reaction as with the entropy comparison — the two numbers are “almost the same.” Except that they’re not, any more than the two entropy values are the same. In the Amazon Prime case, the comparison of 2011 vs. 2012 is 4,000,000 vs. 7,000,000 subscribers — an enormous jump of 3,000,000 households in Amazon-speak! (The latest publicly-available number for 2020 is log10(A) = 8.3010, which is left as a homework exercise for the reader to convert.)
Although the Amazon Prime example is for a vastly-larger population (106-to-8 subscribers) than a typical sensemaking study (102-to-4 respondents), the principle is the same: small differences in log-derived quantities do not necessarily imply small degrees of impact. Which raises the question, how much impact can the entropy index have in a sensemaking project? The most important answer is: Don’t use the entropy index, S, in isolation. It should be treated as simply one more bit (especially if you use base 2) of insight that can complement other signifiers, statistical distinctions among cohorts and demographics, story theming, and the client’s ground-truth knowledge of the target population.
Beyond that caution, there are two plausible uses for this “holistic indicator,” in Parrott’s wonderful phrase. Firstly, a client might want more information before acting: “By identifying areas of high entropy, where uncertainty is the greatest, we can… make the most efficient use of limited resources by focusing on acquiring data that will provide the greatest reduction in uncertainty” (Vallarino, 2023). In other words, use a high-entropy index (again, along with other data) to identify areas of highest uncertainty for follow-up with the clientele prior to taking action. In Laurie’s experience with more than 100 projects, this is an unlikely scenario — most clients want to spend the bulk of their time and funds on programs, not data.
The second option is the more likely one — use a low-entropy index to help identify the people, places, and activities that have the highest certainty of immediate benefits and impact on the affected populations from whatever next steps are identified by the client. These will be the actions that make the “good” words in that pairings list come to life — certainty, coherence, connection, utility.
References
Andrews, F.C. (1980) Clarification and obfuscation in approaching the laws of thermodynamics, pp. 245-256 in Gaggioli, R.A., ed., Thermodynamics: Second Law Analysis, ACS Symposium Series, v. 122. American Chemical Society, Washington, DC.
Jost, L. (2006) Entropy and diversity. Oikos, v. 113, p. 363-375.
Page, S.E. (2011) Diversity and Complexity. Princeton University Press, Princeton, New Jersey. 291 pp. https://www.amazon.com/Diversity-Complexity-Primers-Complex-Systems/dp/0691137676/
Parrott, L. (2010) Measuring ecological complexity. Ecological Indicators, v. 10, p. 1069-1076.
Schneider, T.D. (2014/2018) Information Theory Primer, vers. 2.71, http://users.fred.net/tds/lab/paper/primer/html/index.html, retrieved 28 August 2017; also see, vers. 2.72, http://users.fred.net/tds/lab/papers/primer/primer.pdf, retrieved 18 November 2024.
Stone, J.V. (2015) Information Theory: A Tutorial Introduction. Sebtel Press, Middletown, Delaware. 243 pp. https://www.amazon.com/Information-Theory-Introduction-James-Stone/dp/0956372856/
Tribus, M. (1983) Thirty years of information theory, pp. 475-513 in Machlup, F., and Mansfield, U., eds., The Study of Information: Interdisciplinary Messages. John Wiley and Sons, New York.
Vallarino, D. (2023) Understanding Entropy: Unveiling the Power of Information in Data Acquisition and Predictive Modeling. https://medium.com/latinxinai/understanding-entropy-unveiling-the-power-of-information-in-data-acquisition-and-predictive-4b41820c5639
Additional readings: Landscape ecology
Gao, P., Zhang, H., and Li, Z. (2017) A hierarchy-based solution to calculate the configurational entropy of landscape gradients. Landscape Ecology, v. 32, p. 1133-1146.
• The “content” of the bins in this study is digital elevation, rather than a bio-diversity measure such as number of species living on the landscape. It is valuable because of the detailed look at the inter-relationship among resolution (bin size), counting methods, and Shannon entropy. (The parts about configurational entropy are less relevant here, unless you’re interested in statistical mechanics. But if you are, then you’re probably not reading this far down in the post anyway.)
Vranken, I., Baudry, J., Aubinet, M., Visser, M., and Bogaert, J. (2014) A review on the use of entropy in landscape ecology: heterogeneity, unpredictability, scale dependence and their links with thermodynamics. Landscape Ecology, v. 30, p.51-65.
• An excellent review and extensive source for further reading, written with attention to disciplinary substance, ambiguous terminology, meaning, and bibliographic methodology.
Zurlini, G., Petrosillo, I., Jones, K.B., and Zaccarelli, N. (2013) Highlighting order and disorder in social–ecological landscapes to foster adaptive capacity and sustainability. Landscape Ecology, v. 28, p. 1161-1173.
• This paper looks at how landscapes change over time, as expressed by “spectral entropy,” which is a somewhat different topic. I include it nonetheless for two reasons. Firstly, it has an excellent mini-review (pp. 1165-1166) of the topic and how it relates to Shannon entropy, emphasizing the connection to complex adaptive systems and complexity measures (see Parrott, 2010, in the References). Secondly, it includes this very long sentence (p. 1163):
- Landscape sustainability problems can be addressed in terms of order and disorder: where order implies causality, well-defined boundaries and predictable outcomes to ensure continuous provision of functions and ecosystem services for human use, while disorder designates the circumstance of not knowing which of the four conditions (simple, complicated, complex, chaotic) is dominant at a given moment, implying hazy causality, shifting boundaries and often-unpredictable outcomes due to complex process interactions and higher uncertainty (Snowden and Boone 2007).
That reference is to the HBR paper on the Cynefin framework by Dave Snowden and Mary Boone. So if you were thinking, well, “This entropy stuff surely doesn’t relate to sensemaking, let alone Cynefin, does it?…” The answer is “Actually, yes, it does.”
Footnotes
- Parrott (2010) reviews examples of complexity measures, mostly brought over to ecology from information theory (surprise!) and physics, that can be categorized as temporal, spatial, spatiotemporal, or structural. The metric that I use in this post for story-data ternaries appears as a part or component of some of the temporal and spatial measures that she describes. This suggests that some extension of the “how-muchness” approach for sensemaking might be able to give a single-value measure of complexity… but of what? All triads in a study? All signifiers? The entire “project ecosystem”? I know, it seems far-fetched, or at least very fuzzy. It is, after all, just fantasy in a footnote. Still, I can’t help but wonder. ^
- For many students, their first exposure to thermodynamics in general and entropy in particular leads to confusion and even dread. Illustrative examples with coin tosses and card draws, cyclical engines and efficiency, or micro- and macrostates don’t do much to wipe away the obscurity. I speak from authority on both sides of the pedagogical aisle – as a student in two undergraduate and four graduate courses, and as a teacher for eight years in a senior/beginning graduate course. The thing I finally found that was most helpful was not an explanation of thermodynamics, but a philosophy of how to approach it, captured perfectly in an introductory paragraph to a paper by the late Frank C. Andrews (1980), long-time Professor of Chemistry at UC Santa Cruz. I put it here because I could not find it on the web in its entirety, but its brilliance deserves to be preserved:
-
The laws of thermodynamics have attracted more attention to their formulation than many other scientific laws, perhaps because of the self-contained nature of thermodynamic phenomena and the “beauty” of deriving so incredibly many (valid) results from such seemingly paltry starting material. Also, those who study thermodynamics commonly experience it as one of the hardest subjects they take; it is abstract, its problems challenge their algebraic thinking to the fullest, problems rarely repeat types that were seen before, but they are ever-new and unexpected, thermodynamics is constantly being applied to new phenomena and in different ways. Thus, students experience delayed understanding of thermodynamics, often not mastering it until they have themselves taught or applied it for many years. There is then a temptation to seize on the way it is finally understood as the “right” way, the “best” way, the way that “if only I had heard that years ago, I would have been spared all this puzzled misunderstanding.” We neglect the importance of all those years of work in our own understanding. So we all write our books and offer our formulations as if the way our approaches differed from each other were really very important. And still our students have trouble understanding us whatever book we use, however we state the thermodynamic laws. So we go on hunting for the “right” statements, the ones that will make thermodynamics transparent quickly to everyone who wants to learn. It will be a long hunt.
There is a remarkable contemporary illustration of Andrews’ point that “we all write our books… as if” we have each discovered “the ‘right’ way, the ‘best’ way.” As of publication of this post, the longform repository Medium has about 400 stories that mention entropy, some attempting detailed explanations or tutorials. The TDS/towards data science “subsidiary” of Medium has about 300 additional stories that are typically much more technically oriented. Surely one of those will spare someone, somewhere all that puzzled misunderstanding, right? If not, then perhaps my feeble attempt herein will be successful. More likely, the long hunt will just grow longer. ^
We must, however, make two assumptions. Firstly, the probabilities that we discuss in the balance of this post are based on the counts of a fixed number of data points in any given ternary, rather than Shannon’s limiting case of an infinitely long character sequence. Secondly, in principle, the underlying triangular reference frame for sensemaking data is a continuous surface, whether for data collection (typically a tablet) or display (a ternary plot), up to some resolution limit of the respective device or medium. In practice, however, as we saw in the opening paragraph of this post — and as detailed in the prior Part VIIIb — we are necessarily focused here on binned data in a heat map. Hence, we can use Shannon’s formulation for discrete variables (compare Chaps. 2 and 5 in Stone, 2015). All of which surely satisfies Chip Morningstar’s Goodenoughness metric at a very high level, say, 96%. ^- The storied Dr. Emmett Brown will be reported to say, in response to Marty McFly’s concern that the DMC DeLorean might not be equipped for the roads of the future, “Roads? Where we’re going we don’t need roads.” In other news from the future, the flux capacitor remains elusive. But a scaled-up version of the Mr. Fusion Home Energy Reactor will be the surprise invention of this decade, powering the conversion of every remaining, unoccupied square meter of Arizona and Virginia into AI server farms. (I am indebted to my son for correctly predicting this fate for VA back in 2024.) ^
- There is a famous story of how Shannon finally settled on “entropy” as the name for his measure of ambiguity or diversity or surprise (Tribus, 1983). The pivotal role falls to John von Neumann, the Princeton mathematician and physicist who said, when Claude Shannon asked him for advice, “You should call it ‘entropy’ and for two reasons: First, the function is already in use in thermodynamics under that name; second, and more importantly, most people don’t know what entropy really is, and if you use the word ‘entropy’ in an argument you will win every time!” This was surely rooted in the sense of humor for which he was well known. ^
- If you’re already familiar with Shannon’s equation, or its equivalent from statistical thermodynamics, you probably don’t need to see a formal derivation. Nonetheless, here are some Wikipedia entries that go beyond the references above by Schneider (2014/2018) and Stone (2015) and that show the degree to which entropy and Shannon’s work more generally has penetrated modern science, engineering, and mathematics:
- Entropy (information theory)
- Information content
- Information theory
- Entropy in thermodynamics and information theory
Wikipedia is an absolute rabbit hole on the whole concept of entropy, so enter cautiously. ^
- In calculating the summation, any empty bins (gray in the heat map above) are assigned a value of 0, based on repeated application of L’Hôpital’s rule to the product expression (see Schneider, 2014/2018, p. 5). As long as the binning grid is the same size (say, 10% spacing) for all triads, then the respective values can be compared, regardless of differences in response rate. In the (unlikely?) event of having heat maps with differing grids, “normalized entropy” can be used, where the value of S for each triad is divided by the maximum possible value, Smax = log(M), for its respective bin count (see Schneider, 2014/2018, p. 5). ^

Thank you for this post! I was thrilled to see a new post on this site and I hope they will continue! I really enjoy the exploration of the SneseMaker tools.