Wednesday, August 15, 2012

Deviant ways to compute the deviation

I once found a bug in Excel. This was a long time ago, and they have since fixed it. I also found that same bug in every calculator I could find. I found this ubiquitous bug because I was looking for it. You see, I had found that bug in my own software. Somehow, finding that bug elsewhere made me feel better.
I was computing the standard deviation of all the pixels in an image, so there were umpty-leven thousand data points in the computation1. I used the recommended method, which seemed like a good thing. Boy! Was I ever wrong! All of a sudden, I found I was asking the computer to find the square root of a negative number. And it wasn’t liking me.

No imagination allowed
I found the cause of the error, and realized it was bum advice from my college statistics book. Dumb old college, anyway!
Computation of standard deviation
There are two popular formulae for the computation of standard deviation. The first formula is the definition, given below. This formula is moderately understandable and is invariably the first formula given.
Another formula often follows immediately on the footsteps of the first:
Why this formula should be a measure of deviation from the mean is not immediately obvious, but the leading statistics texts generally devote some ink to showing that the two formulae are equivalent2. They will then point out that the second formula is more convenient to use, and requires less calculations. This rephrasing of the original equation seems to be the formula of choice.
I took a quick poll of the introductory statistics books at hand. The following eight books were sampled. (If a page number is given in parentheses, this is the page where the book introduces the second form of the standard deviation formula.)
    Bevington (p. 14)
    Freedman, Pisani and Purves (p. 65)
    Langley (p. 58)
    Mandel (not given)
    Mendenhall (p. 39)
    Meyer (not given)
    Miller and Freund (p. 156-7)
    Snedecor and Cochran (p. 32-3)
Two of the texts are clearly in love with this second formula.
[The second formula] “has the dual advantage of requiring less labor, and giving better accuracy.”
Miller, et al.
[The second formula] “tends to give better computational accuracy than the method utilizing the deviations.”
There seems to be a strong consensus here. This would seem to be fact then, right?
Difficulty with standard formula
The second formula is often chosen because it requires only a single pass through the data, without having to remember the whole array. As you go through the data points, you sum the values (in order to compute the average), and you sum the squares of the values, and you count the number of data points. These three numbers plug into the second formula to give the standard deviation.
On a calculator, this is a very practical concern, especially back in my day when we used to have to make our own calculators with paper clicks, Elmer’s Glue, and used Popsicle sticks. It is (or was) considered a luxury to be able to store more than a few numbers in the memory. The second formulas managed to get by with storing only three numbers.
So, this would seem to be a recipe for paradise. Six out of eight stats textbooks recommend brushing with the second formula. On top of that, the second formula seems tailor made for a calculator with limited memory.
But, alas, there is trouble in paradise.
I reproduce here a quote from a book which I have found to be very practical-minded, and which has agreed with me every time that I have felt qualified to have an opinion.
[The second formula] “is generally unjustifiable in terms of computing speed and / or roundoff error.”
Press, et al., p. 458
Even more to the point is the following quote
“Novice programmers who calculate the standard deviation of some observations by using the [second] formula...often find themselves taking the square root of a negative number! A much better way to calculate means and standard deviations is to use the recurrence formulas...”
Knuth3, p 216
I believe we have some disagreement here. These two texts are clearly in the minority. But since neither one of these books is actually a statistics books, maybe we should just ignore them? Except for the fact that the minority is right. I became a believer because I was once a novice programmer who found himself taking the square root of a negative number4.
The little difficulty of the second formula comes from a recurring issue that causes numerical analysts to wake up in the middle of the night in a cold sweat: the loss of precision caused by subtracting two numbers that are close in size. If the average of the data is quite large compared to the standard deviation, or if n is quite large, then the second formula will require the difference to be computed between two large and nearly equal numbers. You need a whole lot of precision to pull this off.
To illustrate, consider the following data set of three points: (1,000,001, 1,000,002, and 1,000,003). The average of the data is obviously 1,000,002. We can easily compute the correct standard deviation from the first equation:
errata - the third term should be (1000003 - 1000002)^2
Using the second equation for computing the data set, we first compute the sum of the squares of the data: 3,000,012,000,014. (Gosh, that’s a big number!) Then we compute n times (the average of the data squared): 3,000,012,000,012.
We next take the difference between these two numbers. Provided we have at least 13 digits of precision, we get the correct difference of 2, and subsequently the correct standard deviation of 1. But with less than 13 digits of precision, the difference could be negative or positive, but probably isn’t 2. This is what Press et al., and Knuth were warning of.
Two pass algorithm
The most obvious way to avoid this bug is to go back to the original formula. This algorithm requires two passes through the data: the first pass to compute the mean, and the second pass to compute the variance.
This method has the disadvantage of requiring the data to be retained for a second pass. You just can’t do that in a calculator with limited memory. Knuth (p. 216) suggested a different option.
The recursive algorithm
Now we are ready for the fun part. Let’s say that we have the average of 171 numbers, and we want to average one more number in. Do we have to start over at the beginning and add up all 172 numbers? Or is there some shortcut? Is there ever an end to rhetorical questions?
The answers are no, yes, and “why do you ask so many questions?”
The really fun part is that the sum of the first 171 numbers is just 171 times the average. To get the average of the 172 numbers, you multiply the average of the first 171 numbers by 171. Then you add the 172nd number, and divide by 172. Diabolically clever, eh?
Put in algebraic form, if the average of the first j numbers is mj,
Then the average of the first j+1 numbers is given by this formula
with m1 = x1.
This is known as a recursive formula, one that builds on previous calculations rather than one that starts from scratch. The recursive formula for the average is pretty simple to understand intuitively. There is a bit more algebra in between, but it is possible to calculate the standard deviation in a recursive manner as well. If you know the standard deviation and mean of the first 171 data points, you can calculate the standard deviation of the set of 172 data points from this.
Here is the recursive formula for the variance, which is the square of the standard deviation:
This is the formula for the variance. The standard deviation is the square root of this.
This has been a little lesson in numerical analysis called “don’t subtract one big number from another”. And a little lesson in “don’t believe everything you read”. And lastly, a lesson in “recursion can be your friend”.
Bevington, Phillip, Data Reduction and Error Analysis for the Physical Sciences, McGraw-Hill, 1969
Freedman, Pisani and Purves, Statistics, 1978, W.W. Norton & Co.
Hamming, Richard, Numerical Methods for Scientists and Engineers, 1962, McGraw-Hill
Knuth, Donald E., Art of Computer Programming, Vol. 2, SemiNumerical Algorithms, 2nd edition, 1981, Addison Wesley
Langley, Russel, Practical Statistics Simply Explained, 1971, Dover
Mandel, John, The Statistical Analysis of Experimental Data, 1964, Dover
Mendenhall, William, Introduction to Probability and Statistics, 4th ed. 1975, Duxbury Press
Meyer, Stuart, Data Analysis for Scientists and Engineers, 1975, Wiley and Sons
Miller, Irwin, Freund, John, Probability and Statistics for Engineers, 3rd ed. 1985, Prentice Hall
Press, William, Flannery, Brian, Teukolsky, Saul, Vetterling, William, Numerical Recipes, the Art of Scientific Computing, Cambridge University Press, 1986
Snedecor, George, Cochran, William, Statistical Methods, 7th ed. 1980, Iowa State University Press
1) To be honest, these were the days when a big image was 640 by 480, so there weren’t quite that many data points.
2) Since the derivation is so widely available, it will not be reproduced here. Suffice it to say that the squared quantity in the first formula is expanded, and the summation is broken into three summations, some of which can be simplified by replacement with the definition of the average.
3) Donald Knuth, was born, by the way, in Milwaukee, WI. I currently live in Milwaukee. His father ran a printing business in Milwaukee. I work for a little mom and pop printing company in Milwaukee. Need I go on?
4) I was once a novice programmer, but I don't spend much time programming anymore.  I remain, however, a novice!


  1. There's a typo below "To illustrate, consider the following data set of three points: (1,000,001, 1,000,002, and 1,000,003). The average of the data is obviously 1,000,002. We can easily compute the correct standard deviation from the first equation:"

    Look at the third term in the numerator under the square root.

  2. The square root of -1 can't be detected by calculator as it is i which is called as iota.It equals to the square root of -1.And its square is 1.
    Fluid Dynamics

  3. Greetings and Salutations John!

    Very much enjoyed your article "Deviant ways to compute the deviation".

    Would it be possible to have a .pdf copy of the article?

    Carlos-Luis Santana.