Why does reading XML take so long?

Bob_Schor · ‎07-07-2012

@TailOfGon wrote:

Hi, inspired by smercurio_fc, I made a VI which make use of LabVIEW XML Parser.

Here is the result:

"20 folders" took 14 ms

"3927 folders" took 1.6 sec

"38000 folders" took 130 sec

My Environment:

Windows 7 32 bit, 2GB RAM, Intel Core 2 Quad CPU 2.66GHz / LabVIEW 2011 SP1

Thanks for replicating my results. When you go up a factor of 10 in size (from 3927 to 38000 -- I'm being "sloppy" in my math), the time goes up by a factor of 100, i.e. time grows as the square of the size, a quadratic behavior. Generating the XML and writing it to the file go linearly with the size.

I think I understand the reason for the quadratic behavior. The LabVIEW XML parser does lots of string manipulations using the LabVIEW String functions. These functions generally take a string and split it into several pieces. I can certainly imagine that each time you do this, you need to "mess with memory" (allocate, garbage collect, etc). In general, each time you do this, one string becomes 3 "pieces" (not all of which are carried forward). I don't quite see where the quadratic behavior would arise, but could probably (given the right "inducement") come up with a mathematical model.

As it happens, this has intrigued me enough that I'm spending a few hours this weekend seeing if I can't "do a whole lot better" at parsing XML by avoiding, as much as possible, splitting strings. I'll let you know what I find (although it might just be another embarassment ...)

BS

Bob_Schor · ‎07-07-2012

To see if the behavior is linear or quadratic, we need two "points". Can you run your code on the "small" file and post its time, as well? The question is does the ratio of times similar to the ratio of size ("folders") or to the square of this ratio.

Bob Schor

Bob_Schor · ‎07-07-2012

@BOB Schor wrote:

To see if the behavior is linear or quadratic, we need two "points". Can you run your code on the "small" file and post its time, as well? The question is does the ratio of times similar to the ratio of size ("folders") or to the square of this ratio.

Bob Schor

Oops, I forgot to quote -- the above request is addressed to smercurio_fc, who provided a time point for 3800 records. Can you run your code on the same system, but use a smaller (or larger) data set, posting the results?

BS

TailOfGon · ‎07-08-2012

I used MSXML for much better (linearer) performance.

20 folders / 20 ms
3927 folders/ 578 ms
38160 folders/ 8,864 ms

The program is very similar to the one with LabVIEW XML Parser.

Thanks

TailOfGon
Certified LabVIEW Architect 2013

Bob_Schor · ‎07-10-2012

@TailOfGon wrote:

I used MSXML for much better (linearer) performance.

Very nice. This does seem to indicate that there is nothing inherent in parsing XML that should take quadratic time. So the answer is that the particular algorithm that NI is using has, shall we say, a "design flaw".

BS

Bob_Schor · ‎07-10-2012

Well, I now have a Reply to my original question. Stating it in terms of implementing XML reading in LabVIEW (which is what NI has done with their Read from XML File VI), the algorithm is flawed, done in such a way as to exhibit quadratic, instead of linear, scaling with the size of the problem.

I was sufficiently intrigued by this issue that I started playing around with writing my own Read from XML File VI. It is still "in development", meaning I'm not sure it will parse everything that the NI routine can handle, but as the attached sheet shows, it beats the pants off NI's routine when run against the same dataset. With a slightly different data set and on a different Windows 7 (x64) PC running LabVIEW 2011 SP1 (I upgraded today to SP1 because the nice folks at NI fixed a "connector" issue I recently reported), the LabVIEW NI scales as the 1.8 power of the number of records (almost quadratic), whereas my algorithm scales as the 0.8 power (slightly better than linear).

Both NI's code and my code work in similar ways. We read in the XML file as text, then proceed to transform the text into a set of strings that can be passed to the Unflatten from XML VI to yield the original data. The trick is in generating this new set of strings.

NI basically reads the entire XML file (which consists of multiple lines of text) into a single string, then passes that string to another VI which looks for Start and End tags, splits the string up into pieces (Start Tag, Stuff in Between, End Tag, Rest of String), assembles a new XML Element by concatenating the first three pieces (Start Tag, Stuff in Between, End Tag) along with an End-of-Line, then concatenates this onto an originally empty "final output string". While I haven't done the rigorous analysis, I'm pretty sure the quadratic behavior comes from having a "growing string" and the need to allocate ever-longer chunks of memory to accomodate it.

My routine differs slightly. Guessing that a lot of the quadratic behavior came from splitting strings and re-assembling them, I tried to minimize string splitting and simplify string assembly. I first read the XML file into an "array of strings" that I call "Lines", where each "Line" is a "line" of the XML file (minus the End-of-Line). I now leave this Lines array intact, and manipulate indices into the array. At any one time, I have a Begin Index, an End Index, and a Length, where "Begin Index" is the index of the first "Line" I want to process, "End Index" is the index of the last line, and Length is one more than their difference (just for convenience). I start with Begin Index = 0 (the beginning of the Array), End Index = Array Length - 1, and quit when Begin Index is >= length of the entire Array.

Like the NI routine, I start by looking for a Start Tag. Like the NI routine, my search involves splitting strings, but the string being searched is not the entire XML file, but only the string corresponding to the current "Line" (if I don't find a Start Tag in the current line, I go to the next line). Once I find the Start Tag, I set the Begin Index (because I've found the Line that has the Start Tag), generate the corresponding End Tag (replace "<" with "</" and search and set End Index, then with these two indices, pull out the corresponding "lines" of strings and concatenate them. Each of these strings now represents a single XML Element, used to build the output Array of XML Elements.

Now, I do "cheat" a little. When I was originally developing my code, and concatenating the lines of XML from <Start Tag> to <End Tag>, I forgot to put an End-of-Line at the end of every line, so my intermediate XML Elements array consisted of single-line (very long) strings, whereas the NI version had the same "XML-like" multiline appearance for each element. This makes my file a tad smaller, meaning fewer accesses to memory.

Three comments. If I take the NI output and replace >EOL with just >, I get, byte for byte, the same file that my algorithm produces. Second, both "Unflatten from XML" in the same way, so the extra <EOL>s are not necessary. Third, I don't think (but now realize I'm not 100% certain) that either removing the EOLs from the NI code or putting them back into mine will make much of a difference (I'm prepared to be embarassed by this assertion).

I want to do some more testing, and ensure that my VIs handle all of the XML "varieties" as does the NI code, but once I'm fully confident with it, I'll figure out how to release it to the Community. [Along those lines, suggestions are welcome ...].

With this post, I'm going to consider that my original question is Solved, and mark this as the Solution. I'm attaching a revised timing diagram that has both NI's and my XML reading routines so you can see the difference (provided I remember to attach it ...)

BS

Bob_Schor · ‎07-12-2012

Mmm. Humble Pie is just so delicious ...

While discussing the "slow" reading of XML files with a colleague, I realized that there are two ways to write arrays of clusters as XML. The way I chose to do it is to (inside a For loop) flatten each cluster into XML, then pass the array of XML strings to the Write XML to File function (which is polymorphic, and accepts either a single string or an array of strings). When I subsequently read the XML file with Read XML from File, the function (correctly) returned me an array of XML, which I (again in a For loop) unflattened to get the original array of clusters. This latter process of reading an array of XML strings, as I showed, takes time proportional to the square of the size of the file.

The other way to do it is to simply present the entire array to the "Flatten to XML", get a single string (rather than an array of strings), and pass this to the Write XML function. Similarly, when the file is read, the single string is unflattened to the corresponding array of clusters. If one examines the two XML files created by these two methods, they are identical except that the second file has <array> and </array> tag near the beginning and end of the file. The other, much more significant, difference is that reading now takes about as much time as writing, and is now linear with the size of the file!

A few years ago, I was thinking about generating data files for an experiment, and wanting to save (along with analog samples and time-stamped events) a description of the experiment itself, including "global" information (like the name of the subject and birthdate) and trial-by-trial information (like trial number, reaction time, details of the stimulus). I wanted a format that was extensible, human-readable, and, if possible, "portable". I was thinking about .ini files, but then I heard about XML. I also heard about JKI's EasyXML, and thought "This is for me, it's easy!" (and it was). The nice thing was that as I generated the data (the header at the beginning of the experiment, the trial information at the beginning of each trial, and summary information at the end), I could convert them to an XML string and write them to a text file, reversing the process when I wanted to read back the data.

Of course, the file created this way was not, truly, a legitimate XML file, as it lacked the <?xml> tag at the beginning (but I didn't know enough to realize this). So when this latest project came up, I said to myself "Why don't I try NI's XML routines?", which were also simple to use (though I was puzzled by having a single "Write" function, implying "you only write once"). Having in mind my previous "create the XML as you generate the data", I put my "flatten to XML" inside the For loop generating the array of clusters, and the rest is history.

There are several points (besides humility) this illustrates. One is the need to have a better way to "Open XML" and "Close XML", i.e. to have a LabVIEW routine that you can use to create the "wrapper" that makes a collection of separate XML Elements a cohesive XML file. With this, you could do what I wanted to do with my data acquisition program -- open an XML file, create headers and other data elements as needed and write them to the file after flattening them to XML (using a simple Write to Text File), then use Close XML at the end. The existing LabVIEW Read from XML would then (correctly) read this file, giving you back an array of XML Elements that you could "pre-parse" to see just what kind of data they contained -- if a header, you would Unflatten using the Header typedef; if a Trial, Unflatten with the Trial typedef, and so on. The "improvements" that I am working on to the Read from XML that permit reading an array of elements in linear, rather than quadratic, time, would make this approach feasible, and flexible.

Mmm, pie. A very "educational" (and delicious!) experience.

Bob Schor

SteveChandler · ‎07-12-2012

Have you looked at TDMS? Mabe you need XML but the one thing about TDMS is that it is extremely fast.

=====================
LabVIEW 2012

TroyK · ‎07-26-2013

You may be interested in revisiting the XML read speed topic.

I've done some work with libXML and it is at least 20x faster than any other method I've tried.

Have a look at this NI communities document here: Parsing XML is too slow, or is it?

Using your attached example I adapted it to use libXML and could read 40960 clusters/folders in 4 seconds.

Troy - CLD "If a hammer is the only tool you have, everything starts to look like a nail." ~ Maslow/Kaplan - Law of the instrument

Bob_Schor · ‎07-26-2013

Yes, since my original post, I've learned more about XML, including the fact that I might have been "pushing the boundaries" of what it was supposed to do (and thus I might have been reading and writing a "near-XML" file).

However, if you don't "take a chance of making a fool of yourself", you are not likely to learn (as fast) from your mistakes!

Bob Schor

LabVIEW

Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?