Sample Video Frame

Created by Zed A. Shaw Updated 2024-02-17 04:54:36

Exercise 43: A Simple Statistics Engine

This is a simple algorithm that I use for collecting summary statistics online, or without storing all of the samples. I use this in any software that needs to keep some statistics, such as mean, standard deviation, and sum, but can't store all the samples needed. Instead, I can just store the rolling results of the calculations, which is only five numbers.

Rolling Standard Deviation and Mean

The first thing you need is a sequence of samples. This can be anything from the time it takes to complete a task, to the numbers of times someone accesses something, to star ratings on a Web site. It doesn't really matter what it is, just so long as you have a stream of numbers, and you want to know the following summary statistics about them:

  • sum: This is the total of all the numbers added together.
  • sum squared (sumsq): This is the sum of the square of each number.
  • count (n): This is the number samples that you've taken.
  • min: This is the smallest sample you've seen.
  • max: This is the largest sample you've seen.
  • mean: This is the most likely middle number. It's not actually the middle, since that's the median, but it's an accepted approximation for it.
  • stddev: This is calculated using $sqrt(sumsq - (sum * mean)) / (n - 1) ))$ where sqrt is the square root function in the math.h header.

I will confirm this calculation works using R, since I know R gets these right:

> s <- runif(n=10, max=10)
> s
 [1] 6.1061334 9.6783204 1.2747090 8.2395131 0.3333483 6.9755066 1.0626275
 [8] 7.6587523 4.9382973 9.5788115
> summary(s)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3333  2.1910  6.5410  5.5850  8.0940  9.6780 
> sd(s)
[1] 3.547868
> sum(s)
[1] 55.84602
> sum(s * s)
[1] 425.1641
> sum(s) * mean(s)
[1] 311.8778
> sum(s * s) - sum(s) * mean(s)
[1] 113.2863
> (sum(s * s) - sum(s) * mean(s)) / (length(s) - 1)
[1] 12.58737
> sqrt((sum(s * s) - sum(s) * mean(s)) / (length(s) - 1))
[1] 3.547868

You don't need to know R. Just follow along while I explain how I'm breaking this down to check my math:

  • Lines 1-4: I use the function runif to get a random uniform distribution of numbers, then print them out. I'll use these in the unit test later.
  • Lines 5-7: Here's the summary, so you can see the values that R calculates for these.
  • Lines 8-9: This is the stddev using the sd function.
  • Lines 10-11: Now I begin to build this calculation manually, first by getting the sum.
  • Lines 12-13: The next piece of the stdev formula is the sumsq, which I can get with sum(s * s) that tells R to multiply the whole s list by itself, and then sum those. The power of R is being able to do math on entire data structures like this.
  • Lines 14-15: Looking at the formula, I then need the sum multiplied by mean, so I do sum(s) * mean(s).
  • Lines 16-17: I then combine the sumsq with this to get sum(s * s) - sum(s) * mean(s).
  • Lines 18-19: That needs to be divided by $n-1$, so I do (sum(s * s) - sum(s) * mean(s)) / (length(s) - 1).
  • Lines 20-21: Finally, I sqrt that and I get 3.547868, which matches the number R gave me for sd above.
Previous Lesson Next Lesson

Register for Learn C the Hard Way

Register today for the course and get the all currently available videos and lessons, plus all future modules for no extra charge.