mean vs average

david_jenkinson · ‎04-17-2015

Hi,

The "mean" tool in help is described as:

"mean is the mean, or average, of the values in the input sequence X."

However, when feeding a very large array of single precision numbers into this function, I get a different result than when I take the same array, total up the values in the for loop, and divide by number of samples.

Opening up the "mean" tool vi, it is a dll call which I can't tell what is being done under the hood.

The size of the array has 3,554,361 elements, the source of which is a tdms file where we've gather current data over a period of roughly 49 hours. There was a enough of an error where a noticable difference in our mAh calculation was apparent (value achieved during the test run which is a straight average vs the value calculated upon reloading the data using the "mean" function), thus causing my investigation into the matter.

So it appears "mean" and "average" are not really the same as the help file implies. Is this intentional or a bug?

Thanks

David Jenkinson

RavensFan · ‎04-17-2015

@david_jenkinson wrote:

I get a different result than when I take the same array, total up the values in the for loop, and divide by number of samples.

How different are the results? Can provide a VI with some data saved as default?

altenbach · ‎04-17-2015

@david_jenkinson wrote:

So it appears "mean" and "average" are not really the same as the help file implies. Is this intentional or a bug?

Single precision has a relatively small number of bits for the mantissa, so the order of mathematical operations will have a large effect on the outcome, especially if the values cover a large dynamic range.

After processing 3,554,36 values, the accumulation of errors will be significant.

Are the results more similar if you would convert the array to DBL first? (Just for testing). I would trust the "mean" function, because it does the actual computation in DBL (note the coercion dot).

LabVIEW Champion.

altenbach · ‎04-17-2015

SGL only has a precision of about 6 decimal digits and you are adding a six digit count of such values. Obviously a lot of bits will be discarded in that process.

LabVIEW Champion.

david_jenkinson · ‎04-17-2015

Yes values would help.

The difference in measurement is:

Averaging using total/#samples = 79.88

using Mean vi = 81.72

So in running a experiment, casting the values to doubles before either scenario, I get exactly 81.72 for both.

So is there an internal "cast to double" in the mean vi function?

Also, if I'm understanding correctly, is not a conversion from single to double just adding zeros? As in:

51.327356

would then be

51.32735600000 or such?

Is this correct? Then I would think I'd get the same result no matter what. I'd only expect to get a difference if the number were originally double precision, and I then cast it to single precision, eliminating some accuracy. No?

altenbach · ‎04-17-2015

@david_jenkinson wrote:

So is there an internal "cast to double" in the mean vi function?

Yes, there is a red coercion dot! (Watch for terminology: It is a conversion, nothing to do with casting!)

@david_jenkinson wrote:

Also, if I'm understanding correctly, is not a conversion from single to double just adding zeros?

No, it is adding singificant amounts of mantissa bits to the representation, thus allowing more accurate computations for the additions and division. If you only have 6 significant decimal digits, every addition will be coerced to the nearest binary value. If you do that a few million times, the error will be large.

Look at the following example. Just reversing the array causes a significantly different "average", so which one is right? 😮

(If you would do the same in DBL, the difference between the two results would be around 10E-9, i.e. insignificant.)

LabVIEW Champion.

tst · ‎04-18-2015

@david_jenkinson wrote:

Also, if I'm understanding correctly, is not a conversion from single to double just adding zeros? As in:

There is another factor here, which Altenbach has only alluded to, and which could impact this significantly, which is the range. The way floating point numbers work, their resolution is different at different points on the number axis. In the case of SGL, the resolution drops to 1 (or 2) at 2^24. That means that you can't represent number with a resolution larger than that when you cross that number. Since your total is ~80*3.5M that comes out to 280M, and at that point, the resolution is much lower (try it. Enter a number in that range into a SGL control and you will see it coerce).

DBL has the exact same issue, but the numbers are considerably larger.

___________________
Try to take over the world!

altenbach · ‎04-18-2015

At one point in the game, you reach a situation where X+1=X, see also the detailed discussions about machine epsilon, which is related.

As you can see in my example above, you'll get a significantly better mean estimate if the data is sorted. because the values for each addition operation are better matched in magnitude.

As an example, lets's say we only have a resolution of two significant decimal digits and you have the following numbers in an array.

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 100

If you add the ones first, the addition results is 110 and the mean will be 110/11=10, which is the correct result.

If you reverse the array and start with the 100, the result remains at 100, even after all the additions of the ones (100+1=100 rounded to two decimal digits, see bold assumptions above) and the mean will come out as 100/11= 9.1 (again rounded to 2 significant digits), or off by ~10%. If you would do the same problem with a much larger amount of ones, the error will be much higher.

Whenever the two inputs of an addition have a significant mismatch in magnitude compared to the size of the mantissa, the precision suffers. For single precision (SGL) computations, this effect occurs much earlier and is thus much more severe than with DBL.

LabVIEW Champion.

tst · ‎04-19-2015

@tst wrote:

Since your total is ~80*3.5M that comes out to 280M, and at that point, the resolution is much lower (try it. Enter a number in that range into a SGL control and you will see it coerce).

Now that I can test, I see that the resolution at 280M is 32. That means you can't have values which have a higher resolution.

Look at your total value when you're using SGL and you will just how far off it is. Of course, as Altenbach says, if the value you're adding is less than half of the resolution of where your total is, it won't actually be added to the total because it will be rounded back down, so that's also a factor.

___________________
Try to take over the world!

Bob_Schor · ‎04-19-2015

Well, I was going to suggest sorting the array, forming the sum, then dividing (which should mitigate against the more serious rounding issues), but then I saw how many points were being averaged. What is the range of the data (i.e. what is the maximum and minimum)? This problem has more "legs" that I expected ...

BS

LabVIEW

mean vs average

mean vs average

Re: mean vs average

Re: mean vs average

Re: mean vs average

Re: mean vs average

Re: mean vs average

Re: mean vs average

Re: mean vs average

Re: mean vs average

Re: mean vs average