Here are the slides from today on large datasets, I hope it gave some useful ideas. You can view the speaker notes using the options button on google drive at https://docs.google.com/presentation/d/18aS8gXcMtLelJmKy_FTYs9t-dNq8o1v5LqxylvQegmQ/pub?start=false&..., I've made them user friendly!
www.wiresmithtech.com/blog - I will be putting this there as well and hope to post some information on what I find with MongoDB when I do.
http://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/ - big O notation beginners guide
http://bigocheatsheet.com/ - an example of the scalability of some common CS algorithms.
https://www.coursera.org/course/algs4partI - This is the algorithms course I started on line. It's quite involved and requires some Java use! I'm unaware about whether there may be others in different algorithms.
Hi James, thank you for the presentation, it was an interesting sample case very similar to a project that PTP had just develloped for my old company.
One thing I have since been wondering, is the method for storing the rolling data. The Lossy enqueue seems to be the simplest method (in terms of coding). The second method I'd considered was overwriting an array.
In my early days of LabVIEW I had written something that inserted an element on to an array, and then deleted the first element, but this quickly consumed RAM, my assumption was that the RAM storing elements that had been deleted wasn't being released, so the array was effectively traveling through memory leaving a path of unusable memory behind it. I'm a little concerned that the lossy enqeue may do something similar in terms of memory usage?
Hi James, Thanks for the presentation. As the initial proposer of the session I was interested to see what you would present and it was quite interesting.
One of the things I'm looking at and have an interest in an API for storing and retrieving data in my application - essentially having 'data' coming from multiple sources (e.g. DAQ, CAN). This is something that was previously attempted by another developer using variant attributes and DVRs but it lacked some of the features I needed that would have made it reusable.
This is kind of what I've got so far:
The idea is that my user interface and periodic logging can use this API to access the data.
The way it works under the hood is by storing a variant in a DVR with the DVR reference held in an FGV. My 'data' is stored as variant attributes to allow fast lookups based on the name of the data (e.g. a CAN signal name or sensor/actuator name).
In the implementation I'm basing this on, the actual data itself is stored as a variant in a DVR and the DVR reference is stored in the main variant (so you have a variant with variant attributes that are DVRs to variants - wow that's confusing!). I'm not sure if that is overkill or if it's needed to stop the main variant from growing too large or causing lots of unnecessary memory allocations.
My next steps are to investigate typing of the data using classes (a base data class of variant + some basic signal information and sub-classes for the data types I'm interested in (with a base 'to string' function for logging) that converts the variant to the appropriate data type. A non-OO version would store the type information as an enum/string.
I also want to float the idea of being able to attach queues/events to signals for lossless history or for notifying when new data has arrived.
I'm not sure if there are any serious pitfalls that I'm about to fall into as I try to scale this up.
Thats a good question. You are somewhat correct about that RAM method. Everytime you delete an element from the front you are changing the size of the array, forcing the array to be moved in memory. It may be a complete location move, or at least every element would have to be copied forward. If moved then the memory would become highly fragmented.
The best way to do it with an array is that you allocate the memory size and then maintain a pointer/index to the oldest element. Each new data point you then overwrite the oldest element and increment the index, then to get the complete buffer you have to read everything after the pointer and then everything before and join them. This is the way I normally implement these.
The lossy queue is an interesting question as the queue's memory handling is hidden a little but this is my understanding:
For these reasons I suspect the queues will not show as severe a memory impact compared to deleting off the front of an array.
Wow that is wierdly similar to what we started working on in our coding session! Though my hope for that was primarily around being lightweight so we ruled out subclassing different types (although it is technically possible). So the specific concern is around having all of the system data widely available, is that correct? And is performance a major concern?
I think there are three options I have seen to this sort of problem:
I hope that gives some inspiration but I will be interested to hear what particular concerns are spurring it and that will probably help pick the best solution.
Thanks for the information!
The use case for this is that we have systems that pull data from different sources (e.g. CAN over USB, CAN via cRIO, DAQ) and these signals need to be accessible from multiple places in the software (e.g. logging, various UIs). This is all PC based...
One of the key aims was for it to be flexible - the user can load up a new CAN definition file and all of the data is just...there. Hence the desire to use a name/value dictionary.
As for the performance - this is quite important - but I don't think we're doing anything earth shattering - probably reading in 300 or so CAN elements every 10ms and displaying a lot of these on the UI every 500-1000ms.
There are some other things that were important for us like being able to timestamp data so we can use timeouts.
I wanted the core storage mechanism to be data type agnostic so I can reuse this in other applications or for storing more complex data types - hence thinking about using classes (for each data type so I can use dynamic dispatch to format to string) or a type identifier.
One of the things that I wasn't sure on was how to store the extra attributes I'm interested in (e.g. timestamp - either replace my variant with a cluster of the attributes + the variant data or to store them as variant attributes.
I wanted to be able to use this in other applications - if I can crack a nice little API for this then I'm hoping it'll speed up my development of UI/logging functions if all the data is stored in the same core API.
Maybe I'm asking too much - maybe I can't have it all...