Why does reading XML take so long?

Bob_Schor · ‎07-05-2012

The "answer" may just be that "XML is not intended for this use ...", but here's the situation:

I'm parsing a directory tree and creating an array of clusters, one for each folder in the tree. To save it in "user-friendly" format on disk, I chose XML, in part because "it is there", part because it is a "standard", and part because it "embeds" in the file enough of the LabVIEW data structure that it is easy, in principle, to recover the data in the same form as I originally wrote it (i.e. as an array of clusters with a specified TypeDef).

I pointed my routine at a directory tree with about 2700 folders. It took 14 seconds to parse the tree, about 0.1 second to write out the XML file (1.7Mb), but more than a minute and a half to read it back in and get back the original data. Incidently, I did a comparison of the "data written" and "data read", and all 25,000 elements (the cluster had 9 elements) were identical, so at least reading and writing XML "works".

Still I'm surprised at the 800-fold discrepancy between reading and writing. I suspect this time difference is not linear with the "size of the problem", since I initially tried doing this with a larger array of 4900 elements that took 0.25 seconds to write, but I "gave up" after what seemed far too long to wait (sorry, that's so unscientific, but I think it was 3-4 minutes, at least). However, now that I see there's an 800-fold difference for my earlier example, I'll wait at least five minutes (800 * 0.25 sec = 200 sec, 5 min = 300 sec) ... (time is ticking away ...)

Patience rewarded -- it took 322 seconds, or about 1200 times slower (don't criticize my math -- I'm rounding when I report times, but use the millisecond values when I compute ratios ...). So the More You Do, the Slower It Gets.

Hmm -- let's prove this by doing a smaller folder. How about one with only 85 folders? That takes 21 msec to write, and 102 msec to read, a factor of only 5! Wow, this certainly is not a linear growth.

What is going on here? Is there an inherent problem in parsing large XML files? [I should note that I'm using a single Read from XML File(array) to get the Array (of what will turn out to be clusters), and then in a FOR loop am doing an "Unflatten from XML" to get back the clusters. I'm guessing that the FOR loop is behaving linearly, since all of the clusters are (except for their content) identical, so the "polynomial-time" part must be "Read from XML File(array)".

Not sure why this should be the case. It would seem, to me, that since LabVIEW arrays are always "of identical elements", reading a file with 1000 "elements" and turning it into a 1000-element array should take about 10 times longer than doing this for an array of 100 elements, unless there's something being done extremely inefficiently. Is this something that can bear being examined and possibly optimized?

Bob Schor

P.S. -- was curious enough to do more testing. I started with a "Master Data Set" of 4906 folders, then processed "nested sub-sets" (i.e. a sub-set of the master, a sub-set of the sub-set, etc.) to try to compare "apples with apples" (on a PC, of course). My sample sizes were 20, 210, 2736, and 4906 folders. The speed of writing folders was between 10K and 24K folders/sec over this range, i.e. roughly linear with size. However, a similar measure for reading folders ranged from 3K (for the 20-element set) to 15 (for the full set), decreasing as the size increased. I plotted the data on a log-log plot and got a slope of 0.1 for writing (a slope of 0 means speed is linear with the number of folders), but a slope of -0.95 for reading (

TailOfGon · ‎07-05-2012

I looked at the VI, Read From XML File.vi >> ParseXMLFragments.vi. This subVI does a string manipulation iteration by iteration within a loop (the shift register implies the string is changing every iteration), instead of the faster approach of just walking through the file.

What you could do is to develop a subVI that is a faster version of Read From XML File.vi. Looking at the behavior of the VI, it seems simple to accomplish.

TailOfGon
Certified LabVIEW Architect 2013

Norbert_B · ‎07-06-2012

Obviously, those LV functions are not designed for large number of tags within the XML file.

I am curious about the performance of the .NET XML parser supplied by Microsoft. Does it outperform those LV functions? If yes, what is the difference?

Norbert

Norbert
----------------------------------------------------------------------------------------------------
CEO: What exactly is stopping us from doing this?
Expert: Geometry
Marketing Manager: Just ignore it.

Bob_Schor · ‎07-06-2012

Norbert,

Can you "point me" to the .NET routines from MicroSoft? Is there enough documentations for me to figure out how to do a "Write to File" and "Read from File" with these? [I figure it's a fairer test to use the same set of routines for reading and writing, in case the XML implementation is subtly different].

Norbert_B · ‎07-06-2012

I don't know what is required to have MS XML available (e.g. maybe Visual Studio), but if it is available on your machine, you can use the LV .NET interface to integrate it into LV. The .NET constructor node must be configured to Microsoft.MSXML and depending on your task most probable to "DOMDocumentClass".

But you could also test it outside from LV, i would suggest C# itself.....

Norbert

Norbert
----------------------------------------------------------------------------------------------------
CEO: What exactly is stopping us from doing this?
Expert: Geometry
Marketing Manager: Just ignore it.

Bob_Schor · ‎07-06-2012

Two more "pieces of information". First, my "thinking" was confused when I did my analysis of the XML timings. I should have simply plotted time as a function of number of elements on a log-log plot and seen if the slope was 1 (linear), 2 (quadratic), or "something else". That is, I should have supplied the attached, which shows that creation time for the data structure and for writing it as XML is, indeed, linear (writing is slightly better than linear), while reading is essentially quadratic (or, as I think I may have said, the speed is inversely proportional to the size).

Second, while I haven't followed up on Norbert's suggestion to look at the .NET XML implementation, I did try the JKI EasyXML code. It exhibits the same basic pattern as the NI code, linear in writing, quadratic in reading.

Norbert_B · ‎07-06-2012

It does not surprise me that the numbers are not linear for writing AND reading.

Reading does contain a growing string(array), which requires reallocation and copying around esp. for increasing number of items contained in the XML file.

Norbert

Norbert
----------------------------------------------------------------------------------------------------
CEO: What exactly is stopping us from doing this?
Expert: Geometry
Marketing Manager: Just ignore it.

Bob_Schor · ‎07-06-2012

Oops, forgot attachment.

SteveChandler · ‎07-06-2012

Out of curiosity can you check the performance of EasyXML?

=====================
LabVIEW 2012

TailOfGon · ‎07-06-2012

Which Version of LabVIEW are you using? Depending on what you are using, I recommend LabVIEW XML Parser Functions. I am pretty experienced in this stuff. If LV 8.6 or earlier is what you are using, I do recommend MSXML as they do not support XPath at the time yet. But later versions (maybe not very next version) support it so it's pretty easy to get your select data.

In LabVIEW it is easier to read XML data using XPath than to write an XML using XML Parser functions. So what I suggest is to write XML using a standard string manipulation (but use LabVIEW Escape XML String VI to escape string properly) and to read XML using XPath.

Because my development environment is LV8.6, I now rely on MSXML to make use of XPath.

If you are shifting to this approach, firstly define how your XML is going to look like. Your current XML should be bigger than necessary as it relies on LabVIEW's generic functions. You can create your own xml design and it can be much smaller.

TailOfGon
Certified LabVIEW Architect 2013

LabVIEW

Why does reading XML take so long?

Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?

Re: Why does reading XML take so long?