findimagedupes should be parallelizable
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
findimagedupes (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: findimagedupes
An excellent feature for findimagedupes would be hashing/analyzing multiple images at once, in parallel. Each image can be analyzed independently, and the file IO makes up a minuscule amount of the runtime - the problem is embarrassingly parallel. Practically linear speedups should be perfectly possible.
And the benefits are real: on large collections, the runtime can be many minutes or hours. I have 4 cores which are generally not doing much; why can't they all be used to cut the runtime by half or more?
I looked into running 4 findimagedupes concurrently and then using --merge to bring together their results, but this is deeply hacky and I worry about race-conditions and data consistency in the ultimate fingerprint database; parallelism is something the application should be handling internally.
It's possible that this has been fixed as of 2.18-3: I seem to regularly see findimagedupes using 200-300% in top, or 2 or 3 of my 4 cores.