Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Gabriel_H · ‎06-17-2018

Hi,

I'd like to write interleaved data to a TDMS file, and at the same time (in a separate loop) read that data back. This is similar to what the example "Concurrent Access to TDMS File" shows, however the main difference is that the data is interleaved. Data is written a single value at a time, to multiple channels across multiple groups. This is similar to how a CSV file would be written, with a column per channel and data written to the file a single row at a time.

The problem with this approach is that using the standard TDMS functions causes the TDMS file becomes very large, and the index file is a similar size to the TDMS file. I.e. the file is fragmented. I'm aware of this issue from previous projects, and it is discussed here in more depth: https://forums.ni.com/t5/forums/v3_1/forumtopicpage/board-id/170/thread-id/811943/page/1 . Despite the one header only feature being enabled by default, it only works properly in a very narrow set of circumstances. The solution I've used in past projects is to use the advanced TDMS functions to create the file, write the header data information once, then write the interleaved data using the advanced synchronous write function. This has worked very well, resulting is very efficiently packed files that are fast to write to, can use different data types across multiple channels and multiple groups.

However, when trying to perform concurrent TDMS access, the read loop must also use the advanced TDMS functions. The advanced TDMS synchronous read function returns the data, but it appears to be raw/interleaved. I.e. the read function is essentially useless, unless you are prepared to re-implement the standard TDMS read code in a subVI to decode the file yourself... More information about that is in this post: https://forums.ni.com/t5/LabVIEW/How-to-use-TDMS-Advanced-functions-properly/m-p/2053854#M669302 .

To play with these conecpts, I've modifed the concurrent TDMS example (See attached and below):

Observations:

Advanced TDMS write can only be read by a standard TDMS read after the file has been closed, not concurrently.
The Advanced TDMS read always reads in data raw (underlying file format), regardless of being concurrent or read after the file was closed.
The standard TDMS write can be read by the standard TDMS read, but:
- The write must be set to interleaved, otherwise data read back is still interleaved (channels mixed together).
- With a write mode of decimated:
  - The read loop only returns correct data if no channel properties were written (no header), and the last run of the VI was with interleaved mode or decimated that returned correct values. Perhaps this is a bug in the TDMS functions?
  - The read after the file was closed always returns interleaved data.
Standard Interleaved writes always work, however the file is fragmented with very large index file.
- For example, 10,000 values per channel in the attached VI returns a 1944kB file that defragments to 313kB...

I can't see any way arround this problem, other than using the standard interleaved writes and defragmenting the file after closing it. I'm only writing a data point every 1 or more seconds, and to approximately 100 to 1000 channels spread over 5 to 10 groups. So write speed isn't an issue, and the temporarily fragmented TDMS file shouldn't cause problems. It's just not ideal.

Does anyone have any better suggestions?

Thoughts/ideas regarding TDMS:

Is there a way to deinterleave the data as it is written, producing a file with sequential channel values?
1. See the double array shift registers in the write loop of the attached VI: If I was doing this in memory, I'd preallocate an array per channel, then replace each value as it comes in. Is there a TDMS equivelent? If not, is it a feature that could be added?
2. For a TDMS file, I imagine that would require either preallocating the file on disk (TDMS reserve functions), or by having the TDMS file add fixed length blocks (where channel length is known) as the file expands. I'm aware that TDMS can support this, and does that when writing large waveforms to disk.
TDMS as a format:
1. One advantage I see for using a TDMS file is that it can directly replace common formats like CSV, with the benefit of more efficient binary storage (in some cases the underlying TDMS file structure can be made to look like a binary version of a CSV, when using the advanced TDMS functions). This is overall good.
2. Compared to other formats like HDF5, TDMS lacks flexibility in it's structure. I.e. everything is a 1D channel of data, with limited hierarcy. The tradeoff that I can see for this reduced structure is that you get the benefit of efficient high speed streaming, and minimal file fragmentation when doing so. I.e. TDMS works really well as a living file with large volumes of data being constantly written to it over time.

I'm a little disappointed that this use case of TDMS (mimicking a CSV file with columns of data written row by row) cannot work with concurrent reads while keeping the file structure clean.

Gabriel

wiebe@CARYA · ‎06-18-2018

Have to admit I didn't read your question completely...

I'd try to avoid the concurrent access completely.

Make a TDMS writer loop\process\actor (it's running in parallel). Then send the data from the two processes to it. Let the loop do the interleaving (and optional buffering, etc.), and write to the file.

As an alternative, you might be able to make a functional global (aka smart buffer, LV2 style global, etc.) to perform the task. That would also shift the problem from TDMS to VI level.

Instead of a functional global, I'd prefer an object (class), but it wouldn't be a beginner's class.

Why do need a parallel loop in the first place? You can simply write to file in the top loop, and add the data to the graph there as well?

OT: All solutions would require some (highly needed) modularization in your code. How are you ever going to extend, maintain or reuse this one-liner?

Search LabVIEW like a graph!

Gabriel_H · ‎06-18-2018

This application is actually for a plug-in actor, where all the other actors have their data aggregated and sent to this single data logging actor that takes care of writing everything to a file.

I already have a version of the actor that writes to CSV, and it works well. It actually does what you suggest: when receiving data, it is written to file (channels interleaved into each row) and also appended to a set of arrays in memory that a GUI event loop can access, allowing the user to perform basic X-Y plots of the data.

I'm currently building a clone of the CSV data logging actor/class that uses TDMS as the file format. Instead of storing the data twice, I was looking to keep it just in the TDMS file and use concurrent access when plotting/visualing. However, all the implementations I can think of have draw backs (fragmented TDMS file with slow reads, TDMS advanced read returns raw data that needs to be de-interleaved, or storing a duplicate data set consumes more memory than is required). In this application, I'm not pushing any limits and the data set is very unlikely to ever grow past 1000 channels by 2000 to 5000 values per channel. I.e. it'll likely be less than ~20MB, so I can live with these compromises.

Hooovahh · ‎06-18-2018

Yeah I agree, the right approach would be to have another independent code module (possibly an actor, or functional global) that handles file IO access. This also allows for other useful things to be handled like buffering, and only invoking the write primitive after so many values are ready to be written. This too helps prevent fragmentation.

Unofficial Forum Rules and Guidelines
Get going with G! - LabVIEW Wiki.

17 Part Blog on Automotive CAN bus. - Hooovahh - LabVIEW Overlord

drjdpowell · ‎06-18-2018

Is tdms the right tool here? You're not streaming high-speed data. I use Sqlite for applications like this.

Gabriel_H · ‎06-18-2018

@Hooovahh wrote:

Yeah I agree, the right approach would be to have another independent code module (possibly an actor, or functional global) that handles file IO access. This also allows for other useful things to be handled like buffering, and only invoking the write primitive after so many values are ready to be written. This too helps prevent fragmentation.

I agree with the approach, and I've already built the solution that way (actor, single place for all file access). Once data is sent to the actor using its message queue, it gets logged to a file and never needs to leave the data logging actor. The application performs testing that lasts a few hours, with the primary goal to produce a measurement file that can be later analysed by a different application. Viewing the recorded data during the test is secondary feature, but helps the user monitor the test.

Ideally, it be great if NI could improve/fix the TDMS API allow:

The "one time header" feature to actually work properly with the standard TDMS API, when using multiple groups that each have a timestamp channel and multiple numeric channels.
Allow concurrent access between advanced writing and standard reading: i.e. allowing more control over the file being written (advanced write), but still allow the read operation to decode the file structure (standard read).
- Trying this gives error -68000 (see below), for advanced synchronous or asynchronous writes.
- This is something that can be done with a CSV file (specify the file structure, with a little effort to build the read/write subVIs), and it would be good if the TDMS API could replicate it.

In the mean time, I'll stick to the standard TDMS API, and live with the temporarily fragemented file.

wiebe@CARYA · ‎06-19-2018

What I don't get is why you'd keep the data in memory, and write it TDMS, and then read it from TDMS? Makes more sense to me to let all the "file creator actors" read from memory.

Search LabVIEW like a graph!

drjdpowell · ‎06-19-2018

wiebe@CARYA wrote:

What I don't get is why you'd keep the data in memory, and write it TDMS, and then read it from TDMS? Makes more sense to me to let all the "file creator actors" read from memory.

I think the OP is planning on NOT keeping data in memory, but instead writing to file from one component and viewing it with another. That's a good architecture.

Gabriel_H · ‎06-19-2018

@drjdpowell wrote:

wiebe@CARYA wrote:

What I don't get is why you'd keep the data in memory, and write it TDMS, and then read it from TDMS? Makes more sense to me to let all the "file creator actors" read from memory.

I think the OP is planning on NOT keeping data in memory, but instead writing to file from one component and viewing it with another. That's a good architecture.

The measurement files may have 100 to 1000 channels, but only 2 to 5 will ever be viewed during testing (while the files are open and being written) to produce basic plots to visualise how the testing is progressing. I didn't want to have a duplicate set of data sitting in memory and also on disk, when the copy in memory will only ever have a small portion used. That solution may not scale well when the data set becomes large. Keeping it only in memory would mean that application crashes or power failures could cause all data to be lost. Hence writing it to disk immediately then viewing it back appears to be the safest approach for keeping the data intact and minimising memory usage.

A past project I developed used advanced TDMS functions (synchronous write) for a 24/7 data logger, which could produce several files each with volumes of 1~10GB for each day (TBs per year of data). The advanced TDMS functions worked well to reimplement the "one time header" feature, allowing a scalable soltuion for recording huge quantities of interleaved data. That part of TDMS works very well.

I was looking for a similar solution, but with concurrent reading. That's when TDMS falls apart in terms of being a scalable solution (i.e. standard TDMS fragmentation get's out of control, causing 6x larger files full of meta data describing the scattered file structure). I'm not sure if there are any other file types that would also work well and scale well, because the the data needs to be appropriately ordered on disk to avoid the fragmentation when a file is written over a long time. It's easy to solve once a file is complete: I can defragment a TDMS file, or if the data is known (final length of channels) I can write it efficiently to something like HDF5.

My summary is that there isn't an easy, scalable solution for this use case. I've experimented with a lot of variations, but the only thing that looks like it might work would be advanced TDMS reads (in a number of smaller blocks) coupled with knowledge of the file structure (groups, channels, data types per channel). It looks like I'd need to wrap the TDMS functions inside a class that stores a duplicate of the file structure and uses that during reads to deinterleave and correctly type cast the TDMS advanced read outpute. It looks a lot like reinventing the wheel, and not worth my time considering the TDMS API is meant to do that already.

I'd really like to see NI make an example of TDMS advanced reads of interleaved data from multiple channels of mixed types. So far the only LabVIEW examples I've seen do this with the same channel types.

wiebe@CARYA · ‎06-20-2018

@drjdpowell wrote:

wiebe@CARYA wrote:

What I don't get is why you'd keep the data in memory, and write it TDMS, and then read it from TDMS? Makes more sense to me to let all the "file creator actors" read from memory.

I think the OP is planning on NOT keeping data in memory, but instead writing to file from one component and viewing it with another. That's a good architecture.

What I got what is suppose to happen:

+ For each new chunk of data:

+ Display

+ Write to TDMS

+ In parallel:

+ Reading TDMS + Writing CVS

For me:

+ For each new chunk of data:

+ Display

+ Write to TDMS

+ Write to CVS

would make more sense. At least it would be easier to make.

Search LabVIEW like a graph!

LabVIEW

Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Re: Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Re: Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Re: Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Re: Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Re: Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Re: Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Re: Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Re: Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)

Re: Concurrent access to TDMS file with interleaved data and only a single header (no file fragmentation)