From 04:00 PM CDT – 08:00 PM CDT (09:00 PM UTC – 01:00 AM UTC) Tuesday, April 16, ni.com will undergo system upgrades that may result in temporary service interruption.

We appreciate your patience as we improve our online experience.

GPU Computing

cancel
Showing results for 
Search instead for 
Did you mean: 

cuda pointer

Hello,

I am trying to write DLL function to allocate cuda memory and get back pointer to cuda (device) memory.

Second function should accept this pointer and do the calculation.

I want this operation to be separate because I need to do many calculations on the same data and I am trying to avoid repeatly copying same data to GPU memory (it take a lot of time).

Q1: what should I change in my DLL to get it work? I am not able to get out pointer to "i_d" from "cuda_Malloc" function in Labview.

Q2: how to set Call Library Node when I am passing i_d pointer to "kernel_cuda_calculation"?

My DLL:

   main.cpp:

      extern "C" __declspec(dllexport) int cuda_Malloc ( float *i, void *i_d, int N ){

         for( float x=0; x<N; x++ )

            i=x;


         kernel_cuda_Malloc( i, i_d, N );

         return 0;

      }

      extern "C" __declspec(dllexport) int cuda_Calculation( void *i_d, float *result, int N ) {

         kernel_cuda_calculation
( i_d, result, N );

         return 0;

      }

   simple.cu:

      __global__ void kernelTest( float *i, int N ){

        unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;

        if ( tid<N )

           i[tid] += 10;

      }

      int kernel_cuda_Malloc( float *i, void *i_d, int N ){

         cudaMalloc( (void**)&i_d, N*sizeof( float ) );

         cudaMemcpy( i_d, i, N*sizeof( float ), cudaMemcpyHostToDevice );

         return 0;  

      }

      void kernel_cuda_calculation( float *i_d, float *result, int N ){

         dim3 threads; threads.x = 240;

         dim3 blocks; blocks.x = ( N/threads.x ) + 1;

         kernelTest<<< threads, blocks >>>( i_d, N );

         cudaMemcpy( result, i_d, N*sizeof( float ), cudaMemcpyDeviceToHost );

}

Code is modification of https://decibel.ni.com/content/docs/DOC-20353       

Here is similar question but the guy managed to solve it so there is no correct answer : http://forums.ni.com/t5/LabVIEW/CUDA-device-memory-allocation-and-returns-the-device-ptr/m-p/1963837...

0 Kudos
Message 1 of 5
(8,243 Views)

You're running into multiple issues. Let's address an architectural issue first before answering your specific questions.

Let's suppose you fix the issues with your DLL so that you can retrieve the device pointer and you can pass it to your kernel. The kernel call will periodically fail anyway because the CUDA runtime has execution requirements that LabVIEW's call library node does not meet w/out special handling.

The problem lies in CUDA requirement that a function's parameters must have been allocated from in the same CUDA context (in v4.0 and later). To do this, you have to create a context (or set a device which implicitly creates a context) and then 'push' or 'set' that context on the currenet host thread before you call cudaMallo() and your kernel(s).

This is not easy to do because LabVIEW using any thread available to call functions from your DLL. If you don't set the context and make the calls in the same host thread, you'll get spurious runtime errors from CUDA if you don't set the context (or use the exact same host thread for each call).

The example in the code from DOC-20353 works around this by configuring the call library node to run in the UI thread. It's the easiest solution to force the calls to happen from the same host thread. However, it has an unexpected consequence - data transfers to the GPU perform erratic. This is only seen after several iterations (25-100 at minimum). Although the exact problem is not known, it is known that using the UI thread is the reason for the excessive jitter.

What are your options?

The NILabs LabVIEW GPU Computing was originally created to help work around issues like these but it is out-of-date. We're in the process of releasing the GPU Analysis Toolkit which is designed to address what you're doing. You can request the beta at www.ni.com/beta. It includes CUDA function wrappers for device management, resource allocation, FFT and level-3 BLAS functions as well as a GPU SDK for calling custom GPU kernels.

Now let's tackle your questions:

Q1: I don't see anything wrong from you code so it's possible the configuration of the call library node is the culprit. The CUDA pointer parameter should be configured as an Unsigned Pointer-size Integer and passed (a) by Value when you intend to use the raw pointer value in a function or (b) by Pointer to Value if you are setting it's value.

By configuring using pointer-size integers, you maintain compatibility with 32- and 64-bit versions of LabVIEW.

Q2: This is answered as part of Q1 above.

Additional Comments:

  • I noticed that you are setting the threads and blocks configuration inside the kernel_cuda_calculation function. These parameters can also be passed from LabVIEW allowing you to alter these values at runtime. This helps test performace under different configurations let's you change the parameters from the VI in case they don't work for a particular GPU.

  • Presumably, this prototype is working towards a hybrid G/GPU based computing model. Had you thought about how you were going to 'deallocate' this memory if the VI gets aborted or LabVIEW exits before all GPU computations are finished?

    Unlike typical LabVIEW VI behavior, most CUDA functions and custom kernel execute asynchronously. When you call a CUDA function from a DLL (or a VI), the function returns almost immediately even though the function has not yet finished running on the GPU. It is possible to 'synchronize' the context using another CUDA function but that almost always degrades performance - sometimes significantly.
Message 2 of 5
(5,969 Views)

Hallo MathGuy,

thanks for reply.

I have downloaded GPU Analysis Toolkit beta as you suggest.

Acording to your advices this is my idea:

1.  use "Initialize Device.vi" to get CUDA Context

2.  wire CUDA Context to "Allocate Memory.vi"

3.  set number of elements ( N ) in "Allocate Memory.vi"

4.  choose a type of allocated data in "Allocate Memory.vi" ( possibilities: U8, U16, U32, I8, I16, I32, SGL, DBL, CSG, CDB)

5.  wire CUDA U8 Device Ptr to call library node (no.1) as Unsigned Pointer-size Integer and pass it by Value as i_d

6.  wire array of my data to call library node (no.1) as Array Data Pointer

7.  DLL number one:

     extern "C" __declspec(dllexport) int cuda_Copy ( float *i, void i_d, int N ){                   // I am not sure if use "void i_d" or "void *i_d"

        kernel_cuda_Copy( i, i_d, N );

        return 0;

     }

     int kernel_cuda_Copy( float *i, void i_d, int N ){                                                         // same problem I am not sure if use "void i_d" or "void *i_d"

        cudaMemcpy( i_d, i, N*sizeof( float ), cudaMemcpyHostToDevice );

        return 0;  

     }

8.  from call library node (no.1) where is "DLL number one" wire i_d output terminal to another call library node (no.2) where is "DLL number two" loaded

9.  wire initialized array for results to call library node (no.2)

10.configuration of call library node (no.2) where is "DLL number two":

       i_d ... Unsigned Pointer-size Integer and pass it by Value

       results ... Array Data Pointer

11.DLL number two:

    extern "C" __declspec(dllexport) int cuda_Calculation( void i_d, float *result, int N ) {    // same problem use "void i_d" or "void *i_d"

       kernel_cuda_calculation( i_d, result, N );

       return 0;

      }

      __global__ void kernelTest( float *i, int N ){

        unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;

        if ( tid<N )

           i[tid] += 10;

      }

      void kernel_cuda_calculation( void i_d, float *result, int N ){            // same problem I am not sure if use "void i_d" or "void *i_d"

         dim3 threads; threads.x = 240;

         dim3 blocks; blocks.x = ( N/threads.x ) + 1;

         kernelTest<<< threads, blocks >>>( i_d, N );

         cudaMemcpy( result, i_d, N*sizeof( float ), cudaMemcpyDeviceToHost );

12. for deallocating I would like use third DLL with "cudaFree(i_d);" or rather use "Free Memory.vi"?

Ofcours all of the functions can be in same DLL I just devide it so the text remained clear.

I added my block diagram but I faild because CUDA U8 Device Ptr form "Allocate Memory.vi" is not comatible with Unsigned Pointer-size Integer in call library node.

Am I at least close to right solution?

0 Kudos
Message 3 of 5
(5,969 Views)

Well, you have the right idea but you're mixing two very different solutions. When working w/ the GPU Analsys Toolkit, GPU resources like a device pointer are encapsulated in a LabVIEW classes (.lvclass). This packaging of the GPU data adds the protections necessary to use the raw GPU device pointer from LabVIEW.

To get at the U8 device pointer inside the class and then use it safely in a call to a GPU kernel requires a specific programming pattern in G. Tou can look at Initialize Memory in the toolkit to see how the device pointer is accessed. In fact, all of the toolkit VIs are examples of how to get, set and use GPU resources.

It's difficult to detail the specifics in an email, but here's a high level description of what you'll find in Initialize Memory:

  1. The top level VI is responsible for 'locking' each GPU resource so that cannot be deleted by LabVIEW before it is successfully passed on to the GPU kernel. This is done in two stages: preparing the data for lockiing and then locking it by passing it to the InPlace element structure.
  2. Inside the InPlace element structure, the raw GPU data is retrieved.

  3. Once the raw data is available, they are passed to a subVI - Initialize Memory (Set Context - cudaMemset). This VI is special because it (a) sets the correct context for execution and (b) calls the GPU function. This VI must also have it's execution priority set to Subroutine. This is required so that the set context operator and function call execute in the same host thread.

    NOTE: This is a  requirement from CUDA and is what requires the use of the UI thread for the call library node when you do not use the toolkit to call GPU functions.

  4. In the case of Initialize Memory, the call to the GPU function is packaged in it's own VI - Initialize Memory (cudaMemset). It isn't a requirement but it helps separate the code responsible for calling the external GPU function from everything else. You'll notice that this VI must also have it's execution priority set to Subroutine. However, the call library node here does not have to use the UI thread to call the function. That's because the context is proper set on the execution thread prior to invoking the external GPU function.

This is rather involved but is straightforward once you know the steps and become familiar with the SDK components. I am in the process of developing a detailed step-by-step guide for calling a custom GPU kernel. The document will be online by the time the toolkit ships but I hope to have an earlier version available before then.

Until it's available, the best approach is to find function wrappers that already exist in the toolkit that have similar inputs and outputs and use that code as a template to create your own custom wrappers. I'll do my best to help you if/when you get stuck.

You may also find it helpful to start from the FFT example, remove additional data you don't need, and replace the FFT computation in the middle. This will also simplify your external library because functions like cudaMemcpy are already provided (i.e Download Data and Upload Data). You won't need to call these functions from you special cuda_* kernels.

Message 4 of 5
(5,969 Views)

Thank you for your comprehensive answer. I really appreciate it.

With your advices I manage to run my code without problem.

I am really looking forward to detailed step-by-step guide for calling a custom GPU kernel.

Is a release date of guide or toolkit known?

Regards,

Brano.

0 Kudos
Message 5 of 5
(5,969 Views)