10-10-2017 10:41 PM
If I find some time later, I can put together some benchmarking VI, CPU vs GPU. For sure, up to a certain size, you are better with the CPU, since data copies between the video card and the host RAM also takes time.
Besides, modern CPUs are multicore. So you could even parallelize your matrix multiplications on your CPU, I guess LabVIEW can even do this automatically in some cases.
So I will try to find time, and I will put together some benchmarking, I got interested 🙂 I have a massive gaming nvidia card, and an Intel i7 CPU, so I just need time and install CUDA drivers...
10-10-2017 11:59 PM
Ok, I have installed CUDA on my home PC. If I find time in the evening, I will put together some test code...
10-11-2017 01:15 PM
By the way, do you use LabVIEW 32 bit or 64 bit version?
I just realized this still existing issue with the GPU Toolkit:
https://forums.ni.com/t5/LabVIEW/CUDA-Matrix-Multiplication-Fails/m-p/3262686/highlight/true#M951836
Since I have LV 32 bit installed on a 64 bit Windows10 OS, along with Cuda 9.0 (which has only 64 bit support), I cannot use the libraries from CUDA (no error shown, but the CUBLAS version comes back as 0.0, indicating LabVIEW cannot handle the x64 Cublas dll). So if the VI you posted does not work (result is an array with zeroes), this explains it. If you use LV 64 bit, CUDA GPU Toolkit should function fine. So sorry, I cannot do the benchmarking yet, first I will need to install LV 64 bit version, when I find more time... (I do not want to follow the "hack" mentioned in the above link, that manually copying 32 bit DLLs from and older CUDA Toolkit version).
10-17-2017 01:13 PM
I installed LV 2017 64bit, now it work fine with CUDA 9.0. I got cuBLAS work fine too.
So to recall our discussion, the point is with GPU calcs is that, it takes time to upload/download data to/from the GPU. I just made two simple test VIs, but I am not even close to consider myself a skilled banchmarker, so do not take these values too serious 🙂
So the first snippet just using the CPU. The second the GPU. As you can see, for 100k vectors, the matrix multiplication actually faster using the CPU, if we compare the total operation times. I played with 10M vectors too (100M killed my GPU VI for some reason, overload?), in this case the execution times are in the same range, comparing total execution times.
However, if you can make a smart code which keeps as many operations on the GPU as possible, and you minimize the frequency of data copies between the host and the GPU, you could gain a lot of speed! But all depends on your algorithm...
10-17-2017 02:21 PM
But on the GPU you can (maybe?) use sgl's I you don't require the accuracy. On cpu that is a pain, sine the matrix vi's are all made for dbl's. And since uploading\downloading seems to be the bottleneck, the break even point could be at a much lower number. It's comparing apples with pears, but if speed is important, it might help.
10-17-2017 02:30 PM
wiebe@CARYA wrote:
But on the GPU you can (maybe?) use sgl's I you don't require the accuracy. On cpu that is a pain, sine the matrix vi's are all made for dbl's. And since uploading\downloading seems to be the bottleneck, the break even point could be at a much lower number. It's comparing apples with pears, but if speed is important, it might help.
Yep, actually older GPUs only supported SGL data types. The CUDA VIs support SGLs, the memory up/download and allocation VIs support lots of data types:
The cuBLAS matrix multiplication VI supports matrices with the following data types: SGL, DBL, CSG, CDB.