Sample Video Frame
Exercise 43: A Simple Statistics Engine
This is a simple algorithm that I use for collecting summary statistics online, or without storing all of the samples. I use this in any software that needs to keep some statistics, such as mean, standard deviation, and sum, but can't store all the samples needed. Instead, I can just store the rolling results of the calculations, which is only five numbers.
Rolling Standard Deviation and Mean
The first thing you need is a sequence of samples. This can be anything from the time it takes to complete a task, to the numbers of times someone accesses something, to star ratings on a Web site. It doesn't really matter what it is, just so long as you have a stream of numbers, and you want to know the following summary statistics about them:
- sum: This is the total of all the numbers added together.
- sum squared (sumsq): This is the sum of the square of each number.
- count (n): This is the number samples that you've taken.
- min: This is the smallest sample you've seen.
- max: This is the largest sample you've seen.
- mean: This is the most likely middle number. It's not actually the middle, since that's the median, but it's an accepted approximation for it.
- stddev: This is calculated using $sqrt(sumsq - (sum * mean)) / (n - 1) ))$ where
sqrt
is the square root function in themath.h
header.
I will confirm this calculation works using R, since I know R gets these right:
View Source file ex43.1.sh-session Only
> s <- runif(n=10, max=10)
> s
[1] 6.1061334 9.6783204 1.2747090 8.2395131 0.3333483 6.9755066 1.0626275
[8] 7.6587523 4.9382973 9.5788115
> summary(s)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3333 2.1910 6.5410 5.5850 8.0940 9.6780
> sd(s)
[1] 3.547868
> sum(s)
[1] 55.84602
> sum(s * s)
[1] 425.1641
> sum(s) * mean(s)
[1] 311.8778
> sum(s * s) - sum(s) * mean(s)
[1] 113.2863
> (sum(s * s) - sum(s) * mean(s)) / (length(s) - 1)
[1] 12.58737
> sqrt((sum(s * s) - sum(s) * mean(s)) / (length(s) - 1))
[1] 3.547868
>
You don't need to know R. Just follow along while I explain how I'm breaking this down to check my math:
- Lines 1-4: I use the function
runif
to get a random uniform distribution of numbers, then print them out. I'll use these in the unit test later. - Lines 5-7: Here's the summary, so you can see the values that R calculates for these.
- Lines 8-9: This is the
stddev
using thesd
function. - Lines 10-11: Now I begin to build this calculation manually, first by getting the
sum
. - Lines 12-13: The next piece of the
stdev
formula is thesumsq
, which I can get withsum(s * s)
that tells R to multiply the wholes
list by itself, and thensum
those. The power of R is being able to do math on entire data structures like this. - Lines 14-15: Looking at the formula, I then need the
sum
multiplied bymean
, so I dosum(s) * mean(s)
. - Lines 16-17: I then combine the
sumsq
with this to getsum(s * s) - sum(s) * mean(s)
. - Lines 18-19: That needs to be divided by $n-1$, so I do
(sum(s * s) - sum(s) * mean(s)) / (length(s) - 1)
. - Lines 20-21: Finally, I
sqrt
that and I get 3.547868, which matches the number R gave me forsd
above.
Register for Learn C the Hard Way
Register today for the course and get the all currently available videos and lessons, plus all future modules for no extra charge.