MathScript version of "riffle"

EngrStudent · ‎06-07-2017

This is a follow-on to:

http://forums.ni.com/t5/LabVIEW/discrete-index-sampling-without-replacement/m-p/3635654

What is the MathScript command that does same as 'R' language "sample"? Sample is similar to "riffle" vi, but allows a number-to-draw, and bin-weights to be used as well.

The r-code I am trying to move is:

y <- sample(x = my_bins, size = N_samp, prob = bin_freq, replace = TRUE)

The "randpermutation" isn't going to work. The bin probability "bin_freq" changes sample frequencies for each of the bins. N_samp is at least 35x larger than my_bins.

Bob_Schor · ‎06-07-2017

Forget MathScript, already (or, alternatively, forget LabVIEW and do it all in MatLab, or R).

I'm uncertain what "bin-weight" means, but it sounds like "I have 5 red marbles, 3 blue marbles, 8 white marbles, and 20 black marbles. Help me to draw out 6, one at a time, and tell me their color(s)".

The solution is simple -- create a "pool" of 36 colored marbles, assign numbers 1..36 (or 0..35) to each of them, "riffle" them, and take the first six. Do you know how to "assign numbers to them"? A method I use is to (temporarily) put them in a Cluster whose first element is their "number", riffle the Cluster, then throw away the numbers. Incidentally, this also works when you want to sort your marbles (just don't lose your marbles).

Bob Schor

ogk.nz · ‎06-07-2017

I'm not even sure you need to assign numbers to them, you don't care what position they were in originally only the value within them right?

EngrStudent · ‎06-08-2017

@Bob - think of it as a different form of the data. If you have a histogram, then you have 2d data. You have a list of bins (x-axis) and you have a y-axis. You can formulate the y-axis as either counts or probabilities.

The histogram is also known as "empirical probability density function" or ePDF. There is another form of the data, one that maps random-number generation from the 0-1 domain (everyone has a rand function there), that is the eCDF, the empirical cumulative density function. It is the cumulative sum of the ePDF. It is usually normalized by the sum in case the PDF y-data is in count form and not as probability densities. There are minor variations where they adjust bin edges to improve quality, or use different interpolants if the "physics" commend it.

You can do the same resampling with each form of the data, but let's say you have a million rows, but only 10 buckets on the x-axis of the histogram. Your resampling has two options. You can pull from the million rows, or you can look at 10 buckets + 10 weights. The weights can make a pretty dense form for random number generation from arbitrary distributions.

It takes forever in non-compiled matlab. We are talking days. I need to accelerate it, so I am moving to MathScript/plain-G.

Bob_Schor · ‎06-08-2017

@ogk.nz wrote:

I'm not even sure you need to assign numbers to them, you don't care what position they were in originally only the value within them right?

Right!

BS

Bob_Schor · ‎06-08-2017

EngrStudent -- I wasn't quite sure of your point, but have a "reverse" question -- are you trying to create a Random Number Generator for some other probability density function, say Gaussian, Poisson, something "tabulated"? If so, this is a reasonably well-studied question, and you should be able to find some good literature on this (I remember reading something in one of Knuth's books).

Bob Schor

altenbach · ‎06-09-2017

@EngrStudent wrote:

You can do the same resampling with each form of the data, but let's say you have a million rows, but only 10 buckets on the x-axis of the histogram.

I am still not sure what you mean by "rows" and "buckets". Can you explain graphically or with an example?

An old example to generate random numbers with a defined probability distribution can be found in my old post here.

LabVIEW Champion.

EngrStudent · ‎06-09-2017

I have to use 'R' here, because I am still much faster in it and fluent in it than in LabVIEW.

Here is code for the graph:

library(car) #qqPlot
N_samp <- 1e6;
y <- 0.5*rnorm(n = N_samp,mean = 0, sd = 1 ) + 0.5*rnorm(n = N_samp,mean = 4, sd = 1.5 )
y <- round(5*y)/5
qqPlot(y)
grid()

Here is the normal quantile plot (aka qqplot):

This plot shows

the rounding induced discretization
range of y
good linearity because, unsurprisingly, these are normal distributions on a normal quantile plot

If there was no rounding it would have been a nearly straight line and have 1e6 unique values, but it might have made the plot less informative. The big power of such a plot is being able to pretty cleanly detect the distribution species with an eyeball.

So lets make a textbook histogram of this data.

Code:

hist(y)

Plot

In this column plot, the x-axis is a bin. There were 42 unique values of the rounded "y" but this only has 17 unique y locations.

Here is the list of edges, and counts.

So what? Statistics was invented to be able to get one or few numbers that capture characteristic behaviors of some larger population of data. This is where mean, median, mode, variance, or IQR come from, right? What does this tabular formulation give that the others don't?

The true best density compression would be to know that this is a weighted sum of two normal distributions. The weights, means, and variances capture 100% of all the physics .... for this data. The real world is rarely so clean. Most of the time we have no idea what the underlying distributions are. George Box, a giant in stats, says "All models are wrong, but some are useful". Most of the time, we get close, but not actual.

So there is a way to "reconstitute" a distribution of data, such as the original data, using a uniform random number generator and an eCDF plot. For each uniformly randomly generated point, we map it through the ecdf to the x-axis. So a random value of 0.2, with the plot below, would give an x-axis value of about 1.0.

I know a statistician might call it simple, but I like simple, and I like useful. We can remove more points, and if we have spline interpolation, we can "reconstitute the original distribution. We can make the millions of points as "good as new" after having boiled them down to a few tens of numbers. While it isn't the exact weights/means/variances, it is pretty close, but it gives equivalent results for a wide range of distributions. In places where the real world isn't exact, it uses what the real world provides without presuming about the data.

There is one (big) requirement here, and that is a cousin of Nyquist sampling: we have to make "bin width" meaningfully smaller than the smallest important "characteristic" scale in the distribution. If, for example, one of the contributing elements was a normal distribution with a standard deviation of 0.1, then we would want our bin with small enough to adequately represent a normal distribution with that spread, likely something about half or a third of that.

I can do this in R, which is relatively fast to program, open-source, and makes pretty plots. R is also interpreted, so, like MatLab it can be achingly slow. R also doesn't handle more than one core well, and it doesn't handle actually big/huge jobs with nearly the aplomb that a decently built LabVIEW virtual instrument does.

EngrStudent · ‎06-09-2017

iSAX is a high-tech version of this. It allows handling Terabyte data on a single workstation.

Please note the colors of the header on the page.

I personally suspect this is the "magic smoke" behind Google. If you read google patents they are zealous for "Markov Chain Monte Carlo" which is a relatively solid way to handle iSAX stored, and then reconstituted objects.

This is my personal opinion.

LabVIEW

MathScript version of "riffle"

MathScript version of "riffle"

Re: MathScript version of "riffle"

Re: MathScript version of "riffle"

Re: MathScript version of "riffle"

Re: MathScript version of "riffle"

Re: MathScript version of "riffle"

Re: MathScript version of "riffle"

Re: MathScript version of "riffle"

Re: MathScript version of "riffle"