# Sample Video Frame

# Exercise 43: A Simple Statistics Engine

This is a simple algorithm that I use for collecting summary statistics online, or without storing all of the samples. I use this in any software that needs to keep some statistics, such as mean, standard deviation, and sum, but can't store all the samples needed. Instead, I can just store the rolling results of the calculations, which is only five numbers.

## Rolling Standard Deviation and Mean

The first thing you need is a sequence of samples. This can be anything from the time it takes to complete a task, to the numbers of times someone accesses something, to star ratings on a Web site. It doesn't really matter what it is, just so long as you have a stream of numbers, and you want to know the following summary statistics about them:

- sum: This is the total of all the numbers added together.
- sum squared (sumsq): This is the sum of the square of each number.
- count (n): This is the number samples that you've taken.
- min: This is the smallest sample you've seen.
- max: This is the largest sample you've seen.
- mean: This is the most likely middle number. It's not actually the middle, since that's the median, but it's an accepted approximation for it.
- stddev: This is calculated using $sqrt(sumsq - (sum * mean)) / (n - 1) ))$ where
`sqrt`

is the square root function in the`math.h`

header.

I will confirm this calculation works using R, since I know R gets these right:

```
> s <- runif(n=10, max=10)
> s
[1] 6.1061334 9.6783204 1.2747090 8.2395131 0.3333483 6.9755066 1.0626275
[8] 7.6587523 4.9382973 9.5788115
> summary(s)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3333 2.1910 6.5410 5.5850 8.0940 9.6780
> sd(s)
[1] 3.547868
> sum(s)
[1] 55.84602
> sum(s * s)
[1] 425.1641
> sum(s) * mean(s)
[1] 311.8778
> sum(s * s) - sum(s) * mean(s)
[1] 113.2863
> (sum(s * s) - sum(s) * mean(s)) / (length(s) - 1)
[1] 12.58737
> sqrt((sum(s * s) - sum(s) * mean(s)) / (length(s) - 1))
[1] 3.547868
>
```

You don't need to know R. Just follow along while I explain how I'm breaking this down to check my math:

- Lines 1-4: I use the function
`runif`

to get a random uniform distribution of numbers, then print them out. I'll use these in the unit test later. - Lines 5-7: Here's the summary, so you can see the values that R calculates for these.
- Lines 8-9: This is the
`stddev`

using the`sd`

function. - Lines 10-11: Now I begin to build this calculation manually, first by getting the
`sum`

. - Lines 12-13: The next piece of the
`stdev`

formula is the`sumsq`

, which I can get with`sum(s * s)`

that tells R to multiply the whole`s`

list by itself, and then`sum`

those. The power of R is being able to do math on entire data structures like this. - Lines 14-15: Looking at the formula, I then need the
`sum`

multiplied by`mean`

, so I do`sum(s) * mean(s)`

. - Lines 16-17: I then combine the
`sumsq`

with this to get`sum(s * s) - sum(s) * mean(s)`

. - Lines 18-19: That needs to be divided by $n-1$, so I do
`(sum(s * s) - sum(s) * mean(s)) / (length(s) - 1)`

. - Lines 20-21: Finally, I
`sqrt`

that and I get 3.547868, which matches the number R gave me for`sd`

above.

## Register for Learn C the Hard Way

Register today for the course and get the all currently available videos and lessons, plus all future modules for no extra charge.