GPU Computing

cancel
Showing results for 
Search instead for 
Did you mean: 

Is there any plan in the near future to update the package also for 64-bit applications?

Hi MathGuy, I recently started working on the integration of my LabVIEW code with CUDA. I have a quick question for you: I have a Tesla C1060, which comes with 4GB global memory, and for now I am using WinXP 32-bit, LabVIEW 2009 32-bit, NVIDIA 32-bit drivers+toolkit (v2.3) for the GPU and the LabVIEW GPU Computing 32-bit package which you posted. However, using a 64-bit platforms would dramatically improve my performance in terms of memory usage and increase performances. Is there any plan in the near future to update the package also for 64-bit applications? I saw in the forum many people pointing out the same issue and all of them had to downgrade to 32-bit in order to get the GPU working. Thanks.

0 Kudos
Message 1 of 9
(11,592 Views)

There's no doubt that large data sets perform poorly on 32-bit platforms. We've see it as we've stretched the limits on both linear algebra and PDE solvers.. However, in these cases the problem sizes were manufactured and were not practical given the application spaces. They simply overwhelmed the processing power of the CPU and I/O bandwidth of the GPU(s).

For this reason, we haven't decided to support 64-bit although it has been discussed. So far we've been able to guide users to appropriate solutions using 32-bit. In the case of the GPU memory, you can still make use of the memory beyond 3GB because that data does not have to be maintained in its entirety on the CPU. Basic queues or memory blocking schemes work nicely when coupled with the (parallel) processing blocks.

If you're application does require that most GPU data be mirrored on the CPU simultaneously, then the obviously won't work. Streaming to hard disk will be horribly slow and most ramdisks don't behave well above 3GB (if at all) under 32-bit Windows versions.

I would be interested in learning more about your application. Like I said, we have discussed support of 64-bit and the more applications we see that require it the more likely it is to come to fruition.

0 Kudos
Message 2 of 9
(6,530 Views)

Dear MathGuy

I am working with computed tomography data, i.e. I have to create, handle and display 3D images with sizes ranging between 2 and 48 GByte. I am wondering how this should be possible in a 32bit environment. (And the commercial image softwares all use a 64 bit environment. )

Best regards

Quint

0 Kudos
Message 3 of 9
(6,530 Views)

When dealing with large data sets, 64-bit is a natural choice. We have defined solutions in 32-bit that address certain classes of 64-bit applications (even those dealing w/ tomography).

The first question that needs to be answered is:

What is the largest data size your application needs to manipulate on the CPU?

In most cases, we find that the chunks of data managed at a given time by the CPU work fine under 32-bit constraints. That doesn't mean the application design is as readable or concise as one used in 64-bit, it just means the application can work on a 32-bit OS.

Next, we ask:

In the lifetime of one set of data that doesn't fit into CPU memory on a 32-bit OS, can that data live on the GPU (or GPUs) used for processing?

This targets whether file I/O is necessary component when implementing on a 32-bit platform. We get mixed responses on this but, when rendering is final goal, the odds the answer is YES - the data can live on the GPU(s) - goes up.

While this doesn't answer you direct questions - how to design such a 32-bit application, I hope this offers hints as to whether it's possible in your particular case.

Darren

0 Kudos
Message 4 of 9
(6,530 Views)

My project is moving from Windows XP 32-bit to Windows 7 64-bit, so the idea that our DELL R5400s can be set up for 32GB from their current 4GB is attractive.  We currently can get two 200MB/s streams of digitizer data into the LabVIEW domain, so I'm pushing the system at 2.63GB out of the total 4GB.  Our system plans to run three separate threads; record streams to RAID, check spectrum and isolate signals (DDC, demodulate, decode, etc.)  Because multiple threads cannot access the data at the same time and the refresh occurs in 1 msec, we plan to make a copy for each thread so that requires a lot of memory.  With the just the basic digitizer thread and LabVIEW consuming 1.43GB, I am hoping we can achieve our 3 copies goal.

Another advantage of 64-bit is the possiblility of doubling or quadrupling IO across the bus.  We are dealing with 16-bit I&Q data, so theoretically we could quadruple our transfer rate by sending two sets of I&Q data over a 64-bit bus.  This would be nice to implement on the LabVIEW VIs assoicated with GPU data transfer.  We've already played with the Black-Scholes example on a Quadro FX-1800 with its 64 processors, so I purchased a TESLA C1060 to see what it can do.  The TESLA will get one of the threads, so it can do the heavy duty lifting that the Intel cores cannot finish.  I set up the Black-Scholes DLL with my Visual Studio 2008 and I'm not sure if it supports Windows 7 or 64-bit, so I may be getting a more current version.  I'll be monitoring this forum to see if anyone else has any success in the 64-bit Win7 domain along with what steps they took, and  I'll do the same.  Good luck.

SDR Kid

0 Kudos
Message 5 of 9
(6,530 Views)

Re:In most cases, we find that the chunks of data managed at a given time by the CPU work fine under 32-bit constraints.

That is a trivial statement. At a given time the CPU or any other processor (GPU...) only works with 32 or 64 bit times the number of threads/channels which work in parallel. Taking your argument further, you only need a few kb of memory.

The RAM memory is used because its access time is much faster than e.g. from a hard disk. This means the programm performance suffers from a low memory.

Re:In the lifetime of one set of data that doesn't fit into CPU memory on a 32-bit OS, can that data live on the GPU (or GPUs) used for processing

Are you assuming that the GPU memory is bigger than the CPU memory? CPU memory is still cheaper, isn't it? And if you are using the GPU just for storing data you might as well run a 64bit OS, simulate a 32bit environment for Labview where it runs, and write a C programm for shuffling data around.

Re:While this doesn't answer you direct questions - how to design such a 32-bit application

I didn't ask how to design it. I made the statement that it isn't practical. This means that it is more efficient do create my own Labview 64 bit GPU solution than to use a 32 bit solution which makes everything more complicated every time I have to programm something. Or it is more efficient just to abandon labview alltogether and use C+ instead.

The task of a programming language vendor, IMO, is to make life easier for the programmers, not more complicated. Therefore my question if NI is planning to create a 64bit GPU solution.

Best regards

quint

0 Kudos
Message 6 of 9
(6,530 Views)

Your points are well taken and many 64-bit applications are not efficiently modeled in a 32-bit OS. The questions posted earlier are far more relevant to the set of NI customers who are reluctant or who cannot move to a 64-bit OS.

For example, we have a customer using LabVIEW FPGA to do volumetric reconstruction for Optical Coherence Tomography (OCT). LabVIEW FPGA only works from 32-bit versions of LabVIEW. For this reason, they used the 32-bit version of the GPU Analysis Toolkit beta to compare performance between FPGA- and GPU-based volumetric reconstruction. Although a 64-bit toolkit is available, they couldn't use it.

In this OCT application, there are some valuable takeaways regarding the system architecture and its data:

  1. Data transfers were in 4MB block sizes corresponding to 1K laser scans of depth 2K each.
  2. The FPGA & GPU reconstruction computations consume only a fraction of the processing power of either platform.
  3. Although the GPU stored the entire volume for rendering results, such a buffer on the host PC resulted in sluggish performance.

Item #3 may catch some readers off guard. The following provides insight into why this happens under a 32-bit OS:

  • While the amount of GPU memory is limited by the 32-bit OS, it is only used by processes deployed on the device.
  • While the amount CPU memory limited by the 32-bit OS, it gets consumed by all processes running on the host. This can easily approach .5-1GB when considering background processes and services.
  • On the CPU, reading data from large buffers must be read (in smaller chunks) into one or more processor core caches. Since these caches are used by other processing, competition degrades performance. The same is not true on GPUs as memory to stream processors transfers are protected by the execution context.

When looking at a 64-bit OS, working with larger data sizes is not limited by system memory but is still an important factor. Using an arbitrarily large data size can be just as detrimental to performance as block sizes which are excessively small. Because many block sizes can result in reasonable performance, modeling in a 64-bit OS offers more flexibility than in a 32-bit OS.

0 Kudos
Message 7 of 9
(6,530 Views)

Are there any examples associated with the new toolkit?  If not, can you post one?  I need a little help getting started.

Randall Pursley
0 Kudos
Message 8 of 9
(6,530 Views)

The toolkit ships with a multi-channel 1D FFT example. It covers device selection and resource management in addition to FFT functionality. This is a good representative of how to incorporate GPU computing using the toolkit.

The example is documented in detail in the online help and can be found in the LabVIEW examples directory : <labview_dir>\examples\lvgpu.

0 Kudos
Message 9 of 9
(6,530 Views)