Thoughts on Stream-to-Disk Application and Memory Fragmentation

wired · ‎09-01-2007

I've been working on a LabVIEW 8.2 app on Windows NT that performs high-speed streaming to disk of data acquired by PXI modules. I'm running with the PXI-8186 controller with 1GB of RAM, and a Seagate 5400.2 120GB HD. My current implementation creates a separate DAQmx task for each DAQ module in the 8-slot chassis. I was initially trying to provide semaphore-protected Write to Binary File access to a single log file to record the data from each module, but I had problems with this once I reached the upper sampling rates of my 6120's, which is 1MS/sec, 16-bit, 4-channels per board. With the higher sampling rates, I was not able to 'start off' the file streaming without causing the DaqMX input buffers to reach their limit. I think this might have to do with the larger initial memory allocations that are required. I have the distinct impression that making an initial request for a bunch of large memory blocks causes a large initial delay, which doesn't work well with a real-time streaming app.

In an effort to see if I could improve performance, I tried replacing my reentrant file writing VI with a reentrant VI that flattened each module's data record to string and added it to a named queue. In a parallel loop on the main VI, I am extracting the elements from that queue and writing the flattened strings to the binary file. This approach seems to give me better throughput than doing the semaphore-controlled write from each module's data acq task, which makes sense, because each task is able to get back to acquiring the data more quickly.

I am able to achieve a streaming rate of about 25MB/sec, running 3 6120s at 1MS/sec and two 4472s at 1KS/sec. I have the program set up where I can run multiple data collections in sequence, i.e. acquire for 5 minutes, stop, restart, acquire for 5 minutes, etc. This keeps the file sizes to a reasonable limit. When I run in this mode, I can perform a couple of runs, but at some point the memory in Task Manager starts running away. I have monitored the memory use of the VIs in the profiler, and do not see any of my VIs increasing their memory requirements. What I am seeing is that the number of elements in the queue starts creeping up, which is probably what eventually causes failure.

Because this works for multiple iterations before the memory starts to increase, I am left with only theories as to why it happens, and am looking for suggestions for improvement.

Here are my theories:

1) As the streaming process continues, the disk writes are occurring on the inner portion of the disk, resulting in less throughput. If this is what is happening, there is no solution other than a HW upgrade. But how to tell if this is the reason?

2) As the program continues to run, lots of memory is being allocated/reallocated/deallocated. The streaming queue, for instance, is shrinking and growing. Perhaps memory is being fragmented too much, and it's taking longer to handle the large block sizes. My block size is 1 second of data, which can be up to a 1Mx4x16-bit array from each 6120's DAQmx task. I tried added a Request Deallocation VI when each DAQmx VI finishes, and this seemed to help between successive collections. Before I added the VI, task manager would show about 7MB more memory usage than after the previous data collection. Now it is running about the same each time (until it starts blowing up). To complicate matters, each flattened string can be a different size, because I am able to acquire data from each DAQ board at a different rate, so I'm not sure preallocating the queue would even matter.

3) There is a memory leak in part of the system that I cannot monitor (such as DAQmx). I would think this would manifest itself from the very first collection, though.

4) There is some threading/threadlocking relationship that changes over time.

Does anyone have any other theories, or comments about one of the above theories? If memory fragmentation appears to be the culprit, how can I collect the garbage in a predictable way?

andre.buurman@carya · ‎09-02-2007

Could be that you think you have higher throughput, since the DAQmx tasks are running faster, but is the File writing process able to keep up with that speed.

Do you wait for the queue to be empty before starting a new iteration?

For garbage collection: Flush the queue. It will return all remaining queue items.

André

Regards,
André (CLA, CLED)

DFGray · ‎09-04-2007

It sounds like the write is not keeping up with the read, as you suspect. Your queues can grow in an unbounded fashion, which will eventually fail. The root cause is that your disk is not keeping up. At 24MBytes/sec, you may be pushing the hardware performance line. However, you are not far off, so there are some things you can do to help.

Fastest disk performance is achieved if the size of the chunks you write to disk is 65,000 bytes. This may require you to add some double buffering code. Note that fastest performance may also mean a 300kbyte chunk size from your data acquisition devices. You will need to optimize and double buffer as necessary.
Defragment your disk free space before running. Unfortunately, the native Windows disk defragmentor only defragments the files, leaving them scattered all over the disk. Norton's disk utilities do a good job of defragmenting the free space, as well. There are probably other utilities which also do a good job for this.
Put a monitor on your queues to check the size and alarm if they get too big. Use the queue status primitive to get this information. This can tell you how the queues are growing with time.
Do you really need to flatten to string? Unless your data acquisition types are different, use the native data array as the queue element. You can also use multiple queues for multiple data types. A flatten to string causes an extra memory copy and costs processing time.
You can use a single-element queue as a semaphore. The semaphore VIs are implemented with an old technology which causes a switch to the UI thread every time they are invoked. This makes them somewhat slow. A single-element queue does not have this problem. Only use this if you need to go back to a semaphore model.

Good luck. Let us know if we can help more.

wired · ‎09-04-2007

Thanks for the advice. With respect to the 65K chunk size, are you saying that if I take my flattened string and make multiple calls to WriteBinaryFile with 65K chunks rather than a single write of, say , 4MB, that it would actually be faster? It seems like the additional overhead required to do that would offset the benefits, but I'd be willing to try it if you think it will make a difference. I do have multiple types of data records being written to a single file, so I either need to stick with a single queue or create one for each record type. For now, I think I'll stick with a single queue until I can get a handle on the memory problem. I am running a lower speed acquisition (one that I know the file write process can keep up with) over a longer period of time to see what happens with the queue and memory. If the memory showing in Task Manager is slowly growing, yet the VI profiler is not showing any VIs with increasing memory, what does this mean? Can it just be that there is fragmentation of the memory, and the same number allocated bytes is taking up more space? Also, where would a named queue show up in the VI profiler? In the VI that creates it?

DFGray · ‎09-05-2007

If you take your 4MByte chunk and write it in 65,000 byte chunks to disk it will be faster than writing a single 4MByte chunk to disk. I don't know why, but that has been a constant in Windows for as long as I have been checking. It is true for FAT16, FAT32, and NTFS file systems. It is true for Windows 98, NT, and XP. All bets are off on Linux or Mac.
I would believe the Task Manager before I believed the profiler. Sometimes the profiler is incorrect. Sometimes the memory use is in something the profiler does not look at. Are your reusing your queue reference or creating a new one every iteration? If you are creating a new one, do you close the old one first? If you are creating a new one and not closing the old one, you will get unbounded, slow, memory growth (and performance degradation). Standard practice is to create the queue outside the loop and use a shift register to pass it in and use it. Close the queue when the loop exits. I am not sure where queue memory shows up in the profiler. My guess would be the top level VI, but I really don't know. You can find out by creating a queue and giving it a 4MByte object. Look for where the 4MBytes shows up.
Once your application is up and running, you should not be allocating much memory, unless your queues are growing in time. LabVIEW is pretty good about reusing memory in loops and queues, so this should not fragment your memory space. LabVIEW leaves some fairly large memory holes open for use, the biggest of which is 800MBytes - 1.1MBytes, depending on which version of LabVIEW you use. 4MByte chunks should easily fit.

Please post some screenshots or code if you continue to have problems so we can help you better. Most of my above comments are pure speculation.

wired · ‎09-05-2007

DFGray,

Thanks for your comments - I will try incorporating a loop into my file writing routine. The named queue that I use is obtained in the initialization portion of my code. Once I start filling/emptying it, I do an Obtain/Enqueue(or Dequeue)/Release, which occurs in various tasks loops.

I've run some more tests at lower sample rates with longer acquisitions, and got some interesting results. For instance, I ran three boards at 500KHZ and two DSAs at 1KHz for two hours and 15 min, which almost filled up my hard drive. Task manager started at about 350MB, then slowly crept up to 387MB, which is where it remained for the duration of the acquisition.

I repeated the test at 800KHZ, running for 75 minutes, and the memory grew from 364MB to 512MB. It took a big jump at one point in the acquisition, going from 410MB at T+32 minutes to 493MB at T+40 minutes. When the acqusition completed, the memory dropped down to 367MB. One interesting thing to not is that I left the program running after the acqusition, and went to Windows Explorer and deleted my 90GB file. Task manager memory dropped to 314MB. Does this mean anything to you?

I am running some more long-term tests while I am working on another project. If there is a particular test you think I should run, please let me know and I'll set it up.

Thanks!

Ben · ‎09-05-2007

Try "pre-writing" your files.

One of the slowest parts of writing to a disk, is allocating the space.

File alloction in breif:

Move heads to read index file (may already be cached).

Find free sectors and update index file.

Move heads and update directory info to reflect additional sectors.

Move heads to were data is to be written, write data.

If file is not contiguous, move heads to new loaction and continue writting file

So the slow stuff is moving the heads around on the disk.

If you "pre-write" and excessively large file AND defrag the disk so that the 'pre-written" file is contiguous sectors...

Then you will elliminate most of the disk latency (rotational latency still present when incrementing cylinders).

So I suggest you concentrate on optimizing you disk write speed first.

Just tring to help,

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

DFGray · ‎09-06-2007

From your description, it really sounds like your disk just is not keeping up. The memory issue is the queue buffering your data for write. If you are still having problems after you implement the 65,000 byte chunking and Ben's suggestion, you may need a faster disk or a RAID array. Good luck.

Ben · ‎09-06-2007

Wired,

Could you please update us on your findings?

This way we can help others in similar situations in the future.

thank you,

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

wired · ‎09-06-2007

Update so far: I tried the chunking writes, and did not see any discernable difference in throughput. I first tried 65536 bytes, because I thought maybe that was what DFGray meant, then I tried 65000 bytes, which is what he wrote 🙂

I ran a low-speed test overnight (20KHz sample rate), and when I cam in in the morning, my memory in task manager was OK, by my timed loop that I use for pulling the data out of the log queue and writing it to disk was running much slower than it should, even after I stopped the data acquisition.

I've been running tests all day to support NI on another problem, so didn't get to do much with this today. Will keep this updated with my findings, though. Thanks for the interest and advice thus far.

LabVIEW

Thoughts on Stream-to-Disk Application and Memory Fragmentation

Thoughts on Stream-to-Disk Application and Memory Fragmentation

Re: Thoughts on Stream-to-Disk Application and Memory Fragmentation

Re: Thoughts on Stream-to-Disk Application and Memory Fragmentation

Re: Thoughts on Stream-to-Disk Application and Memory Fragmentation

Re: Thoughts on Stream-to-Disk Application and Memory Fragmentation

Re: Thoughts on Stream-to-Disk Application and Memory Fragmentation

Re: Thoughts on Stream-to-Disk Application and Memory Fragmentation

Re: Thoughts on Stream-to-Disk Application and Memory Fragmentation

Re: Thoughts on Stream-to-Disk Application and Memory Fragmentation

Re: Thoughts on Stream-to-Disk Application and Memory Fragmentation