From Friday, April 19th (11:00 PM CDT) through Saturday, April 20th (2:00 PM CDT), 2024, ni.com will undergo system upgrades that may result in temporary service interruption.

We appreciate your patience as we improve our online experience.

LabVIEW Idea Exchange

cancel
Showing results for 
Search instead for 
Did you mean: 
X.

Standard Deviation and Variance VI "Weighting" input Default should be "Population", not "Sample"

The "Std Deviation and Variance.vi" of the Probability & Statistics" Palette:

 

Screen Shot 2014-12-30 at 12.06.14.png

 

does exactly what the description says.

Wait!

What is that "Weighting (Sample)" input THAT IS NOT A REQUIRED INPUT? And more importantly, what is its Default Value "Sample" doing?

Let's take a look at the doc:

 

It computes variance = 1/W*sum((X_i-mean)^2) where the X_i are the array elements (total number N) and mean is their average and:

 

W = N when weighting = Population

W = N - 1 when weighting = Sample (the default).

 

I am not a statistician, but I would be surprised if many in the engineering and scientific fields are using the second definition.

If you have a large sample, the difference is minimal.

In the other cases, all bets are off.

In particular try N = 1.

 

There is a very verbose mention of it in the new source of all knoweldge, wikipedia, at the end of the article, but it has been proposed to move it to some more specialized article. In other words, nobody cares, unless they are statisticians. In this case, they'll use anything but LabVIEW to analyze data.

 

So either set the default value to "Population" OR make the input required AND the doc much clearer about the consequences of the weighting choice.

 

20 Comments
X.
Trusted Enthusiast
Trusted Enthusiast

Just to clarify: I have been using this VI for a while and in fact, because, for the first time, I needed to use the sample variance rather than the "usual" (population) one, I figured that the Default setting was not appropriate and in fact a serious potential source of headache.

 

The "sample" definition is of value for small sample size, where indeed, with a "population" estimate, the variance may well decrease (and in fact reach zero for a sample size of 1), while intuitively, you'd expect the "uncertainty" of your mean value to be getting larger as your sample size decreases.

But I'd argue that unless you are studying the effect of sample size, this is not the definition that you will expect to be employed to compute the sample variance.

JimChretz
Active Participant

Are you saying that LabVIEW should force me to work hours to make my old and perfectly working VIs to be compatible with LabVIEW 2015?

 

I'll have to search all instances of the "Std Deviation and Variance.vi" and change the enum constant "Population:1" value to "Population:0"

 

How would you make it backward compatible?

X.
Trusted Enthusiast
Trusted Enthusiast

If they force "Required", then it is just a matter of finding all those unconnected instances where you thought you were using the "standard" definition of the variance (connecting nothing), when in fact, you were not.

As a side note, they could also slightly optimize the code, as  there is at least one unnecessary array subtraction...

wevanarsdale
Member

The "population" variance defined above only makes sense for very large N.  The sample variance has exactly the right limit (indeterminant) as N goes to one.  How can you estimate deviation from the mean with only one sample?  The sample statistic is an unbiased estimator of the variance for an infinite population.  The "population" statistic is a biased estimator that typically gives low values.  Why would you use a biased estimator for finite sample sets associated with the input array?  I would retain the default input values. 

X.
Trusted Enthusiast
Trusted Enthusiast

@wevanarsdale: you did read my first comment, right?

 

My point is not about what is the recommended formula to compute the variance. This is the user's choice.

My point is that, IN MY EXPERIENCE, most people I work with are accustomed to the definition corresponding to the "population" definition. When they use a function called "Variance", I bet that they do not suspect that there could be other definitions that would be force fed into their code unknowingly. I am just raising this potential issue (that I have faced myself). If people want to vote this down because the most active LabVIEW forum users turn out to be statisticians, then let people vote it down (or rather, ignore this thread).

Most likely, nobody cares, and this post, which is intended for NI to read and ponder before they release future VIs with "default" values that are not "required", will get very few kudos.

 

Now if you'll allow me to go slightly off topic, one of the problems with using LabVIEW in science or engineering is that is is very difficult to release one's code in the open. Not everyone has a LabVIEW license, the code is not legible without LabVIEW installed, and even then, math is hardly legible to the non initiate (and I'd hazard, even to the own author of the math code) due to the graphical language specificities. In a nutshell, it can not be checked by your peers. Therefore, any result "computed using LabVIEW" is essentially a trust-me result, unless you separately provide the algorithm implemented. This is where not really knowing what a Math or Statistics VI truly does makes thing even worse in terms of scientific reproducibility. And the example I have brought up is among the least problematic, because you can discover what the default setting is and what it means (it is even documented with math forumas!) and furthermore, the code is in G, not hidden in a C DDL...

So if you do not use the intended definition, your colleagues using C, Python, R, MatLab, Mathematica (*) or anyother language are going to come back to you and tell you that they don't obtain the same result. And you will hopefully be able to figure out where the discrepancy comes from.

 

(*) As a side note, all the languages/environments I cited have their own indiosyncrasies.

 

- GNU C has two separate functions, gsl_stats_variance (sample definition) and gsl_stats_variance_with_fixed_mean (population definition), where you need to provide the mean value. They also have an hybrid definition, gsl_stats_varaince_m, where you provide the mean, but the sample definition is used...

 

- Python NumPy's Variance is by default using the Population definition, but you can provide a number of degrees of freedom parameter = 1 to use the Sample definition.

 

- R, as you would expect, computes the sample variance and has no option for the population definition. N=1 returns NaN.

 

- Matlab also has the "sample" variance, var(x) as their default. for N = 1, they use a weight of 1, rather than 0, therefore the variance of a singleton is defined (and equal to the population variance as well). To use the "standard" variance, you need to use var(x,1).

 

- Mathematica also uses the sample variance and for N = 1 returns an error. There is no option for the population definition, although this is the one used pretty much throughout the whole mathworld page on variance (where they call it the...sample variance!). The "sample" variance IS discussed, but it is called the "bias-corrected" sample variance

 

All in all, this sample of definitions shows that things are not as obvious as they may look and I would encourage NI (and other languages) to force the user to think before blindly using the default settings.

 

tst
Knight of NI Knight of NI
Knight of NI

I have absolutely no idea on the math (I fall in X.'s "nobody cares" camp), but a couple of relevant comments:

 

  1. If all those other languages chose to use the other option, presumably there was some reason. Mathematica, not exactly a math slouch, apparently doesn't even have the option you want.
  2. NI can relatively easily change the default value of the input and it won't require anything of the users - they have a mutation process in place and they could simply add this case in - if the VI was called and nothing was wired in, create a constant with the old default value. I'm guessing that this process don't handle things like calls by reference, but I doubt many people are calling the VI this way.

___________________
Try to take over the world!
X.
Trusted Enthusiast
Trusted Enthusiast

@tst: regarding your point 2, you are also ignoring the possibility that users have files (configuration or other type) with the current enum definitions hard coded. Subrepticely changing these definitions would potentially cause problems. This is why I accept that this is not the best option, and making the input required seems preferable.

Another option is to slowly deprecate the current version and replacing it with a new one (as has been done in the past for many VIs). The older version would remain, burried in the vi.lib folder.

 

BTW, there are a lot of other Math VIs which have "recommended" (read optional) inputs that have major impact on the results. For instance, fitting VIs have a "method" input, which changes the results significantly in some cases. Worse, NONE of the inputs are requires, not even the data to be fitted...

tst
Knight of NI Knight of NI
Knight of NI

Why would the enum change? When writing the mutation for something like this, the enum stays untouched and the only thing changed is that VI calls which had nothing wired into the input now have a constant wired in, which has the old default value. Again, this is something that NI does and is automatically applied to any VI which was saved in version <X the first time it is opened. NI already does things like this.


___________________
Try to take over the world!
Tom_Hawkins
Member

Your list of examples shows that R, Matlab and Mathematica follow the same default behaviour as LabVIEW and that's a pretty good indication to me that LabVIEW is probably doing it right.

 

With respect, if you don't understand the difference between sample and population variance, or realise that you need to specify which one you want, then you probably need to sit down and learn a bit more stats before diving in and using these functions in a calculation. For what it's worth, my two elementary stats textbooks both show the 'sample' definition of variance first and clearly flag up the difference between the sample and population definitions.

 

If the documentation can be improved to help users who don't realise this then by all means let's, but I don't see any need to change the VI.

X.
Trusted Enthusiast
Trusted Enthusiast

@nekomatic: I guess my point has been missed altogether... The fact that three other languages use the same definition only proves that they are most likely misused by unsuspecting users as well.