From Friday, April 19th (11:00 PM CDT) through Saturday, April 20th (2:00 PM CDT), 2024, ni.com will undergo system upgrades that may result in temporary service interruption.

We appreciate your patience as we improve our online experience.

LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

FPGA DMA performance Bitfile or VI

But FXP is NOT always 64 bit on FPGA.

 

Why should 64 bits be transferred when on the other side only 8 are being used?

 

The FIFO IMHO should NOT default to 64 bits until it absolutely needs to (Which should not include the transport layer).

 

As to the network stack, this was indeed my question.  I would like a clever explanation as to why a FPGA reference opened with one type of descriptor runs faster or slower than another.

 

Shane.

0 Kudos
Message 11 of 21
(1,677 Views)

@Intaris wrote:

But FXP is NOT always 64 bit on FPGA.

 

Why should 64 bits be transferred when on the other side only 8 are being used?


If I'm correctly understanding MattN's comments, it's always transferring 64 bits because otherwise the processor would have to get involved, which somewhat defeats the purpose of DMA (direct memory transfer between peripherals with minimal coordination from the CPU).  That's a bunch of extra transfers over the bus and a scheduling hit which would decrease throughput further.  Since a typical CPU can't handle data types of arbitrary widths, it is probably easiest to store all floating point values in 64 bits on the host side, regardless of the actual width.

 

Message 12 of 21
(1,672 Views)

That just doesn't fly, sorry.

 

The CPU only needs to take the bits neccessary and discard the others and transfer really only 8 Bits if neccessary.  There's essentially no overhead involded, it's simply truncating the data since the rest is unused anyway.

 

I would either way seriously question the idea of representing all FXPs internally as 64-bit.  Whose brainchild was that?

 

Shane

 

Ps If that were true then sending a U64 over a U8 FIFO would have worse performance than the FXP solution.  I'll check that tomorrow.

0 Kudos
Message 13 of 21
(1,668 Views)

@Intaris wrote:

That just doesn't fly, sorry.

 

The CPU only needs to take the bits neccessary and discard the others and transfer really only 8 Bits if neccessary.  There's essentially no overhead involded, it's simply truncating the data since the rest is unused anyway.


Discarding the bits IS the overhead - it keeps the CPU busy doing work while the transfer is occurring.  Look at how DMA works: the CPU initiates a transfer between two peripherals, and then is free to do other work until an interrupt indicates that the transfer completed.  If the CPU needs to process every single value before it can be transferred, then you can't do DMA anymore because the memory can't be transferred as a block.


Intaris wrote:

I would either way seriously question the idea of representing all FXPs internally as 64-bit.  Whose brainchild was that?


Makes sense to me.  Many operations (multiplication in particular) can extend the number of bits used to represent a fixed-point.  In a byte-based processor, it would be difficult to switch from a 1-byte to 2-byte representation every time this occurs (never mind trying to do a 3-byte value - there's a reason 32-bit processors don't natively support a 24-bit integer).  It's much easier to keep everything in 64-bits and only use the bits that are needed.  On the FPGA it's all just individual lines, so it's easy to extend or reduce precision.


Intaris wrote:

Ps If that were true then sending a U64 over a U8 FIFO would have worse performance than the FXP solution.  I'll check that tomorrow.


I'm not sure if that's true.  Splitting the U64 into 8xU8 happens at the point at which you write to the FIFO so no CPU processing is necessary when the transfer between the FPGA and host buffers occurs.  The bandwidth available for the transfer should be the same (as demonstrated by your results, if I've understood correctly that your throughput is the rate of elements transferred, not bits, and the bit rate is independent of data type).  I don't know if pushing data into the FIFO is close to a constant-time operation regardless of the data size, or if it scales with the number of bits, but in either case the DMA transfer rate should be the same and it's just a question of whether splitting the U64 into 8 U8s limits the rate at which you can fill the FIFO.

Message 14 of 21
(1,656 Views)

@nathand wrote:

Discarding the bits IS the overhead - it keeps the CPU busy doing work while the transfer is occurring.  Look at how DMA works: the CPU initiates a transfer between two peripherals, and then is free to do other work until an interrupt indicates that the transfer completed.  If the CPU needs to process every single value before it can be transferred, then you can't do DMA anymore because the memory can't be transferred as a block.

---

Makes sense to me.  Many operations (multiplication in particular) can extend the number of bits used to represent a fixed-point.  In a byte-based processor, it would be difficult to switch from a 1-byte to 2-byte representation every time this occurs (never mind trying to do a 3-byte value - there's a reason 32-bit processors don't natively support a 24-bit integer).  It's much easier to keep everything in 64-bits and only use the bits that are needed.  On the FPGA it's all just individual lines, so it's easy to extend or reduce precision.


 

I don't understand your answer, sorry.  The big advantage of FXP on FPGA is that is CAN have exactly 24 bits or 13 bits or 5 bits if required thus saving space and processing power.  FPGA is NOT a Byte-based architecture.  Therefore I see no logic in artificially limiting a strictly-typed DMA transfer type to a much wider data type than neccessary.

 

Your last statement confuses me.  If you would expect sending a U64 over a U8 FIFO (And thus discarding 56 bits of each U64) would not be slower than simply sending U8s then I completely fail to understand how you can simultaneously think that the opposite is true or FXPs.  Sorry, I don't get it.  They're both 64 bits.

 

I don't mean to be rude but have you done much FPGA programming?  Bit width and bit expansion and contraction is an important optimisation technique in FPGA, quite the opposite of the generous "A few bits can't hurt" which you seem to be proposing here.  Maybe I'm wrong, but it doesn't seem to gel with my own personal experience.

 

Shane.

0 Kudos
Message 15 of 21
(1,654 Views)

EDIT:  Posted poor info, doing some digging

Cheers!

TJ G
0 Kudos
Message 16 of 21
(1,643 Views)

@Intaris wrote:

I don't understand your answer, sorry.  The big advantage of FXP on FPGA is that is CAN have exactly 24 bits or 13 bits or 5 bits if required thus saving space and processing power.  FPGA is NOT a Byte-based architecture.  Therefore I see no logic in artificially limiting a strictly-typed DMA transfer type to a much wider data type than neccessary.


The FPGA isn't limited to byte-sized types, but the host side is.  On the FPGA, the fixed-point values use exactly as much space as is allocated for them.  They're only extended to 64-bits when they're put into a DMA FIFO, which allows the DMA engine to work efficiently.  The DMA engine, which handles the transfer so the CPU doesn't have to deal with it, operates in multiples of bytes.  The DMA transfer cannot work on arbitrary-width data; that would require the CPU and introduce overhead.


@Intaris wrote:

Your last statement confuses me.  If you would expect sending a U64 over a U8 FIFO (And thus discarding 56 bits of each U64) would not be slower than simply sending U8s then I completely fail to understand how you can simultaneously think that the opposite is true or FXPs.  Sorry, I don't get it.  They're both 64 bits.


Perhaps you could clarify what you meant by sending a U64 over a U8 FIFO.  I took this to mean that on the FPGA, you have a U64 value, which you intended to split into 8 U8 values and feed into a U8 FIFO, so that the host side would receive all 64 bits, and that you wanted to compare this to writing the U64 value directly into a U64 FIFO.  It seems that wasn't what you intended.

 

If you have a U64 value and you only write one byte of it to the FIFO, discarding the other 7 bytes, then those discarded bytes are never written to the buffer and never sent over the DMA channel so they don't affect the bit rate.  However, if you write an 8-bit FXP value, you're still sending 64 bits over the DMA channel, meaning there are 56 unused bits.  All those unused bits are slow down the transfer rate because they're still going through the DMA.  This is analogous to sending a U8 over an U64 FIFO - the inverse of the situation you described - and is obviously somewhat inefficient.


@Intaris wrote:

I don't mean to be rude but have you done much FPGA programming?  Bit width and bit expansion and contraction is an important optimisation technique in FPGA, quite the opposite of the generous "A few bits can't hurt" which you seem to be proposing here.  Maybe I'm wrong, but it doesn't seem to gel with my own personal experience.


Yes, I have done quite a bit of FPGA programming.  I think you've misunderstood my position - maybe I wasn't clear about which operations are happening on the host, and which are happening on the FPGA.  Of course fixed-point bit width is important in optimizing on the FPGA.  As I mentioned above, the 64-bit FXP representation is only on the host, where, in fact, a few bits are more likely to help than hurt, by rounding to a byte-sized unit.

Message 17 of 21
(1,630 Views)

OK, so if the DMA is hardware-limited to 64 bits or 32 bits, whichever, how come an Array of U8 can be efficiently be packed before sending whereas an 8-bit FXP cannot.

 

I still think that (if it is indeed true) having ALL FXP representations on non-FPGA side as 64 bits is a serious design flaw.  Logical would be coercing up to the next byte order (8,16,32,64 bit).  LabVIEW is a strictly-typed language after all.

 

By having an 8-bit FXP represented as 64 bits we have only a quarter of the DMA throughput of a 8-bit FXP compared to a U8.  This is a serious difference which needs to be clearly communicated.

 

Shane.

 

Ps I wasn't trying to dismiss your information re: my question whether youhave programmed FPGA.  I find it's an inportant piece of information regarding how well you know the area.  Myself, I'm not at it too long.

0 Kudos
Message 18 of 21
(1,625 Views)

It's possible that someone from NI will come back with contradictory information to what I've posted - I hope I haven't extrapolated too far from the comment that all fixed-points are stored as 64-bits.  This makes sense to me, as does limiting the units of DMA transfer.  While I'm not familiar with the details of DMA on the Intel platform, I have used it at a low level on a simpler processor (Microchip PIC).


@Intaris wrote:

OK, so if the DMA is hardware-limited to 64 bits or 32 bits, whichever, how come an Array of U8 can be efficiently be packed before sending whereas an 8-bit FXP cannot.

 

I still think that (if it is indeed true) having ALL FXP representations on non-FPGA side as 64 bits is a serious design flaw.  Logical would be coercing up to the next byte order (8,16,32,64 bit).  LabVIEW is a strictly-typed language after all.


Not sure what you're asking here about an array of U8 - I don't see any difference between the handling of an individual byte versus an array.

Back in older versions of LabVIEW, a boolean used to be a byte but an array of booleans was individual bits.  This isn't like that.  It sounds like each 8-bit FXP is expanded to 64 bits before being put into the DMA buffer, regardless of whether it's in an array of an individual element, so that's where the inefficiency is.  The array of U8 is just bytes, no need to expand the size.

 

Operations and memory are relatively cheap on a modern CPU.  Not that you're likely to be doing fixed-point math on the host side, but if you are and you need to extend a fixed-point value, why deal with the complexities of changing bit representation when you could just store everything as 64-bits?  Yes, LabVIEW is strictly-typed - and Fixed-Point is a 64-bit data type.  See the "How LabVIEW Stores Data in Memory" document, which says "Fixed-point numbers have a 64-bit format, signed and unsigned."


Intaris wrote:

Ps I wasn't trying to dismiss your information re: my question whether youhave programmed FPGA.  I find it's an inportant piece of information regarding how well you know the area.  Myself, I'm not at it too long.


Don't worry, I'm not offended.  I started using LabVIEW FPGA around 2005, and it was a lot more limited then.  I've never needed it for such high-speed acquisition that I noticed the difference you've discovered here, though - I've mostly used the FPGA for high-speed control.

Message 19 of 21
(1,615 Views)

nathand wrote:

Not sure what you're asking here about an array of U8 - I don't see any difference between the handling of an individual byte versus an array.

Back in older versions of LabVIEW, a boolean used to be a byte but an array of booleans was individual bits.  This isn't like that.  It sounds like each 8-bit FXP is expanded to 64 bits before being put into the DMA buffer, regardless of whether it's in an array of an individual element, so that's where the inefficiency is.  The array of U8 is just bytes, no need to expand the size.



A 4M Array of U8 will be transported four times as fast as a 4M Array of U32 meaning that they are being packed into a higher bit size for transfer.  The DMA does not send one U8 at a time but packs them together and sends several at a time.  I just wish we could do the same with FXP numbers via a FIFO because all the information we need is there for reducing the data width BEFORE transfer.

 

I see now that FXP is inherently 64 bit on the host.  I dodn't know that, cheers.  Whether that's a good idea or not I'm not so sure.

 

I just did a test to see how expensive it is to discard bits BEFORE FIFO transfer.

 

If I utilise an U8 FIFO but wire up a 4M Array of U64 instead of U8 (Meaning that 56 bits will be discarded) the execution time rises from 40ms to 53ms.  This tells me that the bits are stripped BEFORE transfer.  Sending U64 over a U64 FIFO takes 246 ms meaning that the discarding of bits is a lot faster than stuffing the FIFO with useless data.  In addition, only the time taken to fill the FIFO changes with Bit size over a U8 FIFO, the read time stays the same.

 

Shane.

 

0 Kudos
Message 20 of 21
(1,607 Views)