09-23-2012 04:04 PM
I tested the 1D FFT of a 512*800 array using
1. LVGPU on an OEM NVIDIA GT 440 (144 CUDA Cores, 1.5GB 192-bit GDDR3, Core Clock 594MHz, Shader Clock 1189MHz), the process time is 60 ms.
and
2. FFT.vi in a parallel for loop, on Intel i5 3570K CPU (4 cores, 3.4GHz), the process time is 30 ms.
So is it that GT 440 is a low end GPU that is not fast enough, or there is some overhead time for LVGPU, or am I doing it in the wrong way?
I need to improve the process time to 10 ms, any suggestion? Maybe a GTX GPU?
Best,
Miao
09-24-2012 02:47 PM
I can't speak to your CPU benchmarks and use of parallel for loop. However, I can give you insight into your GPU performanceif your example is using CSG data.
Let me try to summarize the issues w/ your current comparison:
What should you do?
Your goal of 10ms is reasonable so I would keep that requirement and continue refining your solution.
10-12-2012 11:21 AM
Thank you so much for you detailed instructions, and I have asked my advisor to get a better GPU, I will post here when I get the final test result.
There is another question confuses me, I wonder if you might help.
Can GPU toolkit handle memory allocated by "cudaMalloc", or all GPU memory should be allocated using the VI GPU toolkit provided?
Is it possible to use shared memory or texture memory in my customized function?
Miao
10-15-2012 11:08 AM
Hi, MathGuy,
I found that the actual FFT speed is much faster than I previously posted. In my last timing, I mistakenly include the time of set device and the time of making FFT plan. So even my current GPU is low end, it is still better than CPU for doing the FFT. The GPU analysis toolkit is just fantanstic!
But still there is some time cost transfer data from device memory to host for display, is there anyway that I can display directly the data in device memory, using LabVIEW?
Miao
10-15-2012 05:45 PM
You'll find that transferring data back to the host has to overcome two hurdles:
To render data on the device w/out transferring it back requires external code using an API such as OpenGL. If memory serves, examples which share data between CUDA and OpenGL ship with the CUDA SDK (which is now part of the toolkit installation). If not, you can find coding examples posted online by searching for 'CUDA and OpenGL'.
Unfortunately, the GPU toolkit doesn't help create these OpenGL implementations. However, if you wanted to invoke one or more custom render functions based on OpenGL as part of GPU computing, that could be done from LabVIEW using the toolkit (and probably some custom components based on the toolkit SDK).
You mentioned support for textures. Performing rendering via OpenGL is a separate but related issue. The toolkit does not ship with support for the texture data type in CUDA. The primary reason is that textures (and CUDA arrays) do not support double precision data - the most common numeric type used in LabVIEW.
While the texture type is not present, the toolkit SDK is capable of supporting it. Even though I have not created an example yet, I architected the SDK so that it could.
You may find that textures aren't required. It's possible that some OpenGL functions may consume or copy data from a CUDA data buffer as-is. The examples you find should address this out of necessity.
Lastly, you asked about what memory the toolkit handles. The toolkit works natively with memory allocated by cudaMalloc(). In the Driver API the pointer is of type CUdeviceptr in the C interface. According to documentation, this type is consumable by any CUDA function based on the Runtime or Driver APIs and is used internally by the matix and vector types exported by CUBLAS.
Functions exist to 'convert' this type to other special CUDA types such as CUDA arrays and textures but there are certain limitations to each conversion. The documentation does a good job of explaining the trade-offs.