LabVIEW Idea Exchange

Peti · ‎08-07-2013

Dear all Labview fans,

Motivation:

I'm a physicist student who uses Labview for measurement and also for evaluation of data. I'm a fan since version 6.i (year 2005 or like)

My typical experimental set-up looks like: a lot of different wires going every corner of the lab, and it is left to collect gigabytes of measurement data in the night. Sometimes I do physics simulation in Labview, too. So I really depend on gigaflops.

I know, that there is already an idea for adding CUDA support. But,not all of us has an nvidia GPU. Typically, at least in our lab, we have Intel i5 CPU and some machines have a minimalist AMD graphics card (other just have an integrated graphics)

So, as I was interested in getting more flops, I wrote an OpenCL dll wrapper, and (doing a naive Mandelbrot-set calculation for testing) I realized 10* speed-up on CPU and 100* speed-up on the gamer gpu of my home PC (compared to the simple, multi-threaded Labview implementation using parallel for loops) Now I'm using this for my projects.

What's my idea:

-Give an option for those, who don't have CUDA capable device, and/or they want their app to run on any class of calculating device.

-It has to be really easy to use (I have been struggling with C++ syntax and Khronos OpenCL specification for almost 2 years in my free time to get my dll working...)

-It has to be easy to debug (in example, it has to give human readable, meaningful error messages instead of crashing Labview or making a BSOD)

Implemented so far, by me, for testing the idea:

-Get information on the dll (i.e..: "compiled by AMD's APP SDK at 7th August, 2013, 64 bits" , or alike)

-Initialize OpenCL:

1. Select the preferred OpenCL platform and device (Fall back to any platform & CL_DEVICE_TYPE_ALL if not found)

2. Get all properties of the device (CLGetDeviceInfo)

3. Create a context & a command queue,

4. Compile and build OpenCL kernel source code

5. Give all details back to the user as a string (even if all successful...)

-Read and write memory buffers (like GPU memory)

Now, only blocking read and blocking write are implemented, i had some bugs with non blocking calls.

(again, report details to the user as a string)

-Execute a kernel on the selected arrays of data

(again, report details to the user as a string)

-close openCL:

release everything, free up memory, etc...(again, report details to the user as a string)

Approximate Results for your motivation (Mandelbrot set testing, single precision only so far.):

10 gflops on a core2duo (my office PC)

16 gflops on a 6-core AMD x6 1055T

typ. 50 gflops on an Intel i5

180 gflops on a Nvidia GTS450 graphics card

70 gflops on EVGA SR-2 with 2 pieces of Xeon L5638 (that's 24 cores)

520 gflops on Tesla C2050

(The parts above are my results, the manufacturer's spec sheets may say a lot more theoretical flops. But, when selecting your device, take memory bandwidth into account, and the kind of parallelism in your code. Some devices dislike the conditional branches in the code, and Mandelbrot set test has conditional branches.)

Sorry for my bad English, I'm Hungarian.

I'm planning to give my code away, but i still have to clean it up and remove non-English comments...

Intaris · ‎08-07-2013

I have the feeling that we might have to wait for an update to whatever core mathematics library LV is making use of to get this.

However, I agree wholeheartedly with the sentiment and would love to see OpenCL included in LV.

Peti · ‎08-07-2013

"I have the feeling that we might have to wait for an update to whatever core mathematics library LV is making use of to get this."

I do not think on the core mathematics of LV. (GPU-s have huge number of Gigaflops, but a lot of latency and overhead. For example, adding two arrays to each other is a really parallelizable task, but it will run faster on a simple multicore cpu. But, let's say, loading data on a laser spectra (Let's say 10000 elements),doing an inverted discrete fourier transform, to get the pulse shape in time domain, and to add these pulse to itself with different delays, to get autocorrelation curve simulated, is a lot of calculation. Then, read back the results, which also alike 10k elements.)

This idea is meant to : MUCH calculation on relatively SMALL data.

(By the way, that's what Mandelbroth Fractal calculation as testing is good at: >10k iterations per data point, each iteration has let's say 15-20 operations.)

Use cases: Physics simulation,faster off-line evaluation of big measurement data.

No use cases:if the situation is too complex, or the last percent of performance in openCL is needed, it may be better to develop your own wrapper.

See below the attached image of my idea in the current status.

Dragis · ‎08-07-2013

What I would really like is for LabVIEW to be able to take a normal G application and efficiently map it to a GPU engine : ) But until that is available, I would like to see this.

Intaris · ‎08-07-2013

@Dragis, that's effectively what I was hinting at. Still, the ability to target OpenCL directly is definitely a good thing to work on.

Peti · ‎08-08-2013

Yes, this would be a good idea later.(After a lot of development)

Maybe, it could look like a parallelized for loop, "i" can be the get_global_id(0) or alike. Of course, not all functions may be allowed in this structure, just mathematic operations, loops, and array indexing, other array functions.

Or, it could look like a formula node with the kernel code inside. that would be easier to implement.

(in all cases, an error in, error out is necessary, opencl can do crazy things if one doesn't pay enough attention.)

And then, LV can automatically take care for the compilation of the OpenCL code and to initialize and close resources.

(Like in Labview FPGA module.) But this (imaginary) OpenCL module should be included in the minimalist liccence, too (because everyone has a cpu and maybe an useful graphics card...)

EDIT: I'm happy to see the 6 kudos on the idea

Yamaeda · ‎08-10-2013

I wholeheartedly agree. It's actually abit strange NI has opted for a propriety standard instead of an open standard that alot more companies has chose to follow. Maybe they could do some VISA-like and combine both OpenCL and Cuda in general VI's but run either depending on card installed.

/Y

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

Peti · ‎08-11-2013

Yamaeda:"Maybe they could do some VISA-like and combine both OpenCL and Cuda in general VI's but run either depending on card installed."

What does "VISA-like" mean? By the way, the openCL compiler of nvidia, and for an nvidia gpu, is said to link to the CUDA stuff in the first step.

And, in this kind of system, i can get a lot of performance. (not sure about more complicated situations. I personally prefer to run one big kernel at a time, that runs for some seconds to half a minute typically. That's some scientific simulation, not some small easy picture transformation or graphics stuff.)

Yamaeda · ‎08-12-2013

Visa being a general collection of VI's for open/read/write/close regardless of the actual device. GPU-code should be the same, we shouldn't have 1 CUDA-implementation and 1 OpenCL, since the commands ought to be similar enough. 🙂

/Y

G# - Award winning reference based OOP for LV, for free! - Qestit VIPM GitHub

Qestit Systems

Intaris · ‎08-12-2013

Having an OpenCL implementation would make CUDA irrelevant. OpenCL IS the device-independent implementation. Putting another layer of abstraction above it won't really help I think. OpenCL can target CPUs, GPUs and whatever may come in the next years as long as there's an OpenCL driver for it.

David_L · ‎08-12-2013

I don't know a whole lot about OpenCL, but have you tried the new OpenCLV toolkit on the LabVIEW Tools Network? This is not full integration with the LabVIEW IDE, but it might be a good stop-gap.

LabVIEW Idea Exchange

Add OpenCL support