Everybody get random / All gyal dem, all man dem – Lady Sovereign
In Search for Significance, I wrote about using the Chi-Square test to determine if a series of events were repeatable enough to be non-random. It’s a standard way to determine if one group of samples is different enough from another group of samples. My standard “control group” is usually a fifty-fifty split when using the Chi-Square because that is as accurate to the coin flip that I would ever need when using the Chi-Square.
And for the most part, that works pretty well. Except of course when one is dealing with the idea of Entropy in information systems.
Put simply, in the forties Claude Shannon came up with a means to calculate what level of information was needed to be able to transmit data over noisy analog lines – either those used by computers or those used by communication devices. He wrote a paper called “A Mathematical Theory of Communication” in which he discussed the need for understanding a few things:
- How much information is needed to discern information from randomness
- How randomness could be mathematically modeled
- Invented the term “bit”
How much information?
Well, in the day and time they were searching for the most optimal way to transmit, store, calculate information. Shannon calculated that The entropy of English text is between 1.0 and 1.5 bits per letter, or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on human experiments – Information Entropy at wikipedia.
How can randomness be modeled?
According to his formula (included at the end of the article for the math phobics) Shannon determined that A fair coin has an entropy of one bit. However, if the coin is not fair, then the uncertainty is lower (if asked to bet on the next outcome, we would bet preferentially on the most frequent result), and thus the Shannon entropy is lower. A long string of repeating characters has an entropy of 0, since every character is predictable.
What is a bit?
When using this formula
![]()
a “bit” is when the log portion means log base 2.
Let’s get back to Random
Easily done. What this theory states is that if the probability of an outcome is “for real” random, then the result of the above formula is 1 and if it’s “for real” not random then the output is 0.
The sort of side effect of this is that unless the data is seriously skewed (much the same way a series of correct guesses out of a series of 10 would have to be) then the result is for all intentional purposes “random”.
What about the whole “English language has an entropy between .6 and 1.3 bits” part?
To recognize “information” relating to the English language, the receiver must have between .6 and 1.3 bits to determine whether or not it is signal or noise.
I don’t have a pocket protector – speak plainly you
There is a simple way to figure out whether or not the information being given is random or rational (non random). This allows one to use the formula to separate information from chaos, to recognize patterns and turn data into information.
That’s one ugly mother of a formula. You got to be kidding me.
Nope. And here’s another way of looking at the same formula:
Entropy(S) = S -p(I) log2 p(I)
Ugh. I hate you. That makes less sense than the other monster.
Let me explain.
Entropy is another way of saying “the randomness in the data”
The (S) means the sample set.
The funny looking E means “sum up the whole lot of it”.
p(i) means the portion of the sample set that i has. (Think 4 shots out of 100)
log2 is a programmers way of saying “A Shannon Bit”. Excel would say log(number,2) and Open Office would say log(number;2).
I’m going to quote the tutorial from University of Florida that I’ve been working up to because they sum up this example quite nicely:
If S is a collection of 14 examples with 9 YES and 5 NO examples then
Entropy(S) = – (9/14) * Log2 (9/14) + – (5/14) * Log2 (5/14) = 0.940
Notice entropy is 0 if all members of S belong to the same class (the data is perfectly classified). The range of entropy is 0 (”perfectly classified”) to 1 (”totally random”).
To Check out the Chi-Square of 9 Yes and 5 No
against a random sample of 7 Yes and 7 No we receive
Chi squared equals 1.143 with 1 degrees of freedom.
The two-tailed P value equals 0.2850
By conventional criteria, this difference is considered to be not statistically significant.
Where we’ve been today
- We’ve determined a general outline for Information Theory
- We’ve reviewed how the ugly formula can be broken down
- We know that the term “bit” comes from the log base 2 of a number
- We’ve reviewed how to calculate it.
Next Up: Information Gain (where Ross Quinlan improved Shannon’s work)
Extra Credit: In the comments section, provide the Excel formula for the above example



(11 votes, average: 4.27 out of 5)

I tried to leave a comment a little while ago but it seems to have disappeared. The story of my life, gah!
Good work Cuervos, I need to read it over again.
The summation of the (((observed frequency – expected frequency)^2) divided by the expected frequency)…or I guess in your example it would be [((9-7)^2)/7] “plus” [((5-7)^2)/7] = 1.142857143
Oh yeah, the probability is actually .28505
If you want to calculate in Excel, simply use the Chi-test function [=CHITEST] to calculate probability, followed by the Chi-inverse function [=CHIINV].
What’s up with the new/old format Fly?
You know The Fly is all superstitious and shit. He probably thinks the conversion to the new format marked an egregious end to his big dicked gains this year and is trying to reverse his fortune.
I’m fine with that. I’m losing badly this month, even when I appear to be winning.