Optimizing a search for duplicate files in a filesystem

BowenM · ‎09-29-2014

We have a large directory that people have been dumping to for years (~200k files), and I want to go through and look for duplicate files. The first part of this was just scanning through and checking for duplicate file names. I wrote a simple VI to serarch for duplicates, but it is very inefficiant because of multiple string comparisons. Even optimized as much as I could think to do, I'm only scanning about 1k files / minute. I have attached my code - any help would be appreciated.

BowenM · ‎09-29-2014

Oddly enough, it looks like Gerd answerd my question in another thread about 3s after I posted this...

http://forums.ni.com/t5/LabVIEW/Poor-search-performance-in-cluster-with-array-of-singles-100k/td-p/3...

I'll try that first.

BowenM · ‎09-29-2014

I wasn't able to figure out a way to use variants and properties to get a performance boost because I am looking for "n" instances of the file name. Setting the varient property to file name will just overwrite the property of "n-1" with "n", leaving only one instance. Any suggestions?

tst · ‎09-29-2014

If I understand you correctly, you can make the key the file name and the type a numeric (a count of instances). Then, for each file, get, increment and set it. At the end look at all the keys (Get with no inputs). Hopefully that's clear.

___________________
Try to take over the world!

BowenM · ‎09-29-2014

@tst wrote:

Hopefully that's clear.

Clear as mud! I don't see how it solves the searching for N values without a loop. It would make each individual search faster, but I am still iterating 200k^2 times (if I understand you correctly)

I decided to take a slightly less stupid approach and bundle file name with relative path, then sort 1d array based on file name. Still working out the code, but I think it will improve things.

tst · ‎09-29-2014

I didn't look at your original code, but my thought was of a data structure which eventually has <200k keys (one for each actual file name) and then the value of each key tells you how many copies you have of that name. So basically:

File 1: 1

File 2: 23

File 3: 4

And so on.

If you want to track them, you could make the type of the key a 1D array of paths instead of an int, then for each file get the value, append the path to the array and set it.

The iteration over something like this is only once for each file and the red/black tree implementation of the variant attribute lookup is responsible for balancing the tree of attributes and looking through it for the right key for each of the 200k files. I don't remember what the performance is like exactly for a red/black tree lookup, but it should be better than O(N^2).

___________________
Try to take over the world!

BowenM · ‎09-29-2014

I think that would work. I may have to try it later for learning purposes. For now, sorting by file name and then just comparing file(i) to file (i-1) is working. Still not fast, but what was ~3 days of processing time is down to 5 min which I can live with.

Thanks for your help.

Hooovahh · ‎09-29-2014

This is trivial using the variant attributes mentioned earlier. Attached is my method which uses the Variant Repository XNode I developed. Not sure on performance comparisions.

EDIT: I'm betting the thing that will take the longest will be performing the MD5, I'm sure a CRC32 could be done faster which is probably just as good in this situation, or even a sum of the bytes of the file as a quick way of identifying the file.

Unofficial Forum Rules and Guidelines
Get going with G! - LabVIEW Wiki.

17 Part Blog on Automotive CAN bus. - Hooovahh - LabVIEW Overlord

BowenM · ‎09-29-2014

Files were fairly small, MD5 didn't take long.

Thanks for the link to the varient repository. It looks fairly powerful, though it promptly crashed my LV :). (I decided to see what would happen if I wired a cluster in to the type. It accepted it, but the data out broke it)

tst · ‎09-30-2014

Incidentally, it's also important to remember to consider the simpler solution - there are multiple programs around for detecting duplicate files and cleaning them up and you could simply run one of those. I can't recommend any from experience, but I assume at least some should be good.

___________________
Try to take over the world!

LabVIEW

Optimizing a search for duplicate files in a filesystem

Optimizing a search for duplicate files in a filesystem

Re: Optimizing a search for duplicate files in a filesystem

Re: Optimizing a search for duplicate files in a filesystem

Re: Optimizing a search for duplicate files in a filesystem

Re: Optimizing a search for duplicate files in a filesystem

Re: Optimizing a search for duplicate files in a filesystem

Re: Optimizing a search for duplicate files in a filesystem

Re: Optimizing a search for duplicate files in a filesystem

Re: Optimizing a search for duplicate files in a filesystem

Re: Optimizing a search for duplicate files in a filesystem