Slow exclude pattern matching
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Duplicity |
New
|
Low
|
Unassigned |
Bug Description
Large exclude files cause the duplicity to scan files very slowly. My suspicion is that the time is spent checking if each file matches the exclude list. The problem is particularly noticable if the exclude files includes the full prefix of every file. For example:
/home/user/
would be slower than
**/something/
I collected some time data and it appears that the time spent is increasing rapidly, but linearly in the number of exclude patterns.
For example, with 0 patterns a backup of ~1700 file took 14s, with 40 patterns it took 72s. Based on this and the rest of the data on this backup set every pattern adds about 1.5s to the time taken. This was tested with prefixes of the attached "excludeFromBac
Removing the long prefixes and replacing them with ** shows a significant improvement, but still shows noticeable slowdown. In this case the slowdown was 0.1s per pattern. However I suspect the difference would increase with more input files. This was tested with the attached "excludeFromBac
I had better performance in 0.7.03. This is a bit of a problem for me since my actual backups use just over 1800 patterns (generated from makefile clean rules among other things).
I have also attached a file "dupl.log" with the first and last 200 lines. The command line for this run was:
time duplicity full -v9 --exclude-filelist excludeFromBack
$PWD/JunkFromKi
All testing was on:
Ubuntu 15.10
duplicity 0.7.07.1 (from PPA)
Python 2.7.10
Target FS: local btrfs
Changed in duplicity: | |
importance: | Undecided → Low |
I don't know a lot about Python and how regexs work, but from a glance at globmatch.py, it looks like a glob is converted to a regex and compiled afresh every time we want to match it against something. That doesn't seem right - shouldn't we just do the conversion once and precompile all the regexs? Unless there is caching going on that I can't see.
It would be interesting to profile the code and see where the bulk of the time is being spent.