LabVIEW Idea Exchange

cancel
Showing results for 
Search instead for 
Did you mean: 
X.

Standard Deviation and Variance VI "Weighting" input Default should be "Population", not "Sample"

The "Std Deviation and Variance.vi" of the Probability & Statistics" Palette:

 

Screen Shot 2014-12-30 at 12.06.14.png

 

does exactly what the description says.

Wait!

What is that "Weighting (Sample)" input THAT IS NOT A REQUIRED INPUT? And more importantly, what is its Default Value "Sample" doing?

Let's take a look at the doc:

 

It computes variance = 1/W*sum((X_i-mean)^2) where the X_i are the array elements (total number N) and mean is their average and:

 

W = N when weighting = Population

W = N - 1 when weighting = Sample (the default).

 

I am not a statistician, but I would be surprised if many in the engineering and scientific fields are using the second definition.

If you have a large sample, the difference is minimal.

In the other cases, all bets are off.

In particular try N = 1.

 

There is a very verbose mention of it in the new source of all knoweldge, wikipedia, at the end of the article, but it has been proposed to move it to some more specialized article. In other words, nobody cares, unless they are statisticians. In this case, they'll use anything but LabVIEW to analyze data.

 

So either set the default value to "Population" OR make the input required AND the doc much clearer about the consequences of the weighting choice.

 

20 Comments
AristosQueue (NI)
NI Employee (retired)

I checked with math team. "If one has access to all data w=N is even better. w=N-1 is preferred when one can only sample (some data chosen from many a large pool).If one has access to all data w=N is even better. w=N-1 is preferred when one can only sample (some data chosen from many a large pool)." Because this is a statistics function, the default is the definition preferred when doing statistical analysis. The current behavior is intended and is to the best of our knowledge the right solution for most of our users. It will not be modified.

X.
Trusted Enthusiast
Trusted Enthusiast

@AQ: The point is that unsuspecting users can miss the fact that there is a choice, not necessarily the default itself. I should probably ask a moderator to modify the title to reflect this, rather than expressing my personal preference.

I may actually be more general and incorporate this in a new suggestion of not allowing silent default input values in any of the math VIs, since I understand that this Idea will be declined and all my argumentation lost to posterity.

Darren
Proven Zealot
Status changed to: Declined
AristosQueue (NI)
NI Employee (retired)

Your argument is that a user should have to choose rather than silently getting a default. Our argument is that we do not believe this choice is something users should have to know about to use this node as the default is correct for the majority cases. If you have the knowledge, great... make the choice. If you don't, fine, the results will be right most of the time and close to right the rest of the time, still fine for a statistical approximation.

 

This node is designed for users to miss the fact that there is a choice most of the time. That is R&D's intent.

AristosQueue (NI)
NI Employee (retired)

My last post was on this specific issue. There's a more general commentary to make on APIs. The split in opinion on this node is the same split that can occur on most APIs. When should you set a default that is good enough for most and when should you force a conscious decision? There's a constant tension between ease of use and correctness of use. In this case, we lean toward ease of use. In other APIs, we lean toward correctness. There is no right answer across the board. Each case gets evaluated one-by-one by the API authors, sometimes revised after the API gets the light of day with users. In this case, the authors believe that the balance is correct as it stands.

X.
Trusted Enthusiast
Trusted Enthusiast

Last thoughts on that function.

It offers two ways of computing the variance and standard deviation (and also returns the mean for free).

The "Sample" version is calculated as described in most sources I have seen, using the "Sample" definition, i.e., dividing the sum by n - 1, where n is the size of the input array. This results in an UNBIASED variance.

However, I would argue that the "Population" version used is an hybrid: it does NOT use the standard formula for the "Population" definition, which requires knowledge of the "Population Mean". In this respect, the result of the "Population" variance returned by this function is, expectedly, BIASED.

A correct implementation of the "Population Variance" would require a "Mean" input and use it in the calculation, dividing the sum by n. This quantity would be UNBIASED. Note that this is the definition of GNU's gsl_stats_variance_with_fixed_mean's function.

 

The VI linked to below illustrates this point, where "True Pop" uses the correct "Population" definition.

https://decibel.ni.com/content/docs/DOC-41032

dthor
Active Participant

>> A correct implementation of the "Population Variance" would require a "Mean" input

 

Why do you need an input for "Mean" when the node already calculates the mean from the data? If the dataset that you're entering *is* the population, then you don't need an input for mean - the node calculates it for you.

 

There are only two reasons that I could think of for needing a "Mean" input:

  1. When you're explicitly setting the mean - for example, if you want to see what the StdDev or Variance looks like after a mean shift. But by changing the mean, you're implicitly saying that you no longer have the population as a dataset - you have a sample. After all, if you have *all* the data, how could your mean be any different from the population mean?
  2. When you know the population mean before calculating the variance or StdDev. This is purely for computational speed - if you're working on a population data set and you enter a mean that's not the population mean, your variance and StdDev will be wrong.

My experience is that one *never* actually has the entire population data set (I'm in the semiconductor field). After all, the majority use of statistics is to predict how things will happen in the future (example: statistical process control) and you obviously can't add those data points because they haven't happened yet! Smiley Very Happy

Heck, even when the US Census Bureau takes a census, they *still* don't have the entire population - there's always some recluse out in the woods of Maine that refuses to let the gov't acknowledge his existance. Smiley LOL

X.
Trusted Enthusiast
Trusted Enthusiast

@dthor: Try the VI I provide in the link, read the last part of the Wikipedia article on variance, educate yourself (like I did) and you'll see the light (like I did).

What I am saying is that providing the correct mean to compute the variance would eliminate the bias that you get otherwise. You might not know it (the true mena), but then you should be using the sample calculation.

dthor
Active Participant

Ah, I see. I think.

 

So instead of trying to change the current function, as everyone appears to be OK with it, perhaps you'd be OK another subVI that computes the corrected (unbiased) sample variance and one that computes the uncorrected (biased) sample variace (which, if I'm understanding you and the wiki correctly, is what the normal "Std Deviation and Variance.vi" is doing).

X.
Trusted Enthusiast
Trusted Enthusiast

Correct. I have added a comment to the document on the NI community site, to make things hopefully clearer.