Slow exclude pattern matching

Bug #1576389 reported by Arthur Peters
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Duplicity
New
Low
Unassigned

Bug Description

Large exclude files cause the duplicity to scan files very slowly. My suspicion is that the time is spent checking if each file matches the exclude list. The problem is particularly noticable if the exclude files includes the full prefix of every file. For example:
/home/user/data/something/excluded.dat
would be slower than
**/something/excluded.dat

I collected some time data and it appears that the time spent is increasing rapidly, but linearly in the number of exclude patterns.
For example, with 0 patterns a backup of ~1700 file took 14s, with 40 patterns it took 72s. Based on this and the rest of the data on this backup set every pattern adds about 1.5s to the time taken. This was tested with prefixes of the attached "excludeFromBackup-HDD.lst".

Removing the long prefixes and replacing them with ** shows a significant improvement, but still shows noticeable slowdown. In this case the slowdown was 0.1s per pattern. However I suspect the difference would increase with more input files. This was tested with the attached "excludeFromBackup-HDD2.lst".

I had better performance in 0.7.03. This is a bit of a problem for me since my actual backups use just over 1800 patterns (generated from makefile clean rules among other things).

I have also attached a file "dupl.log" with the first and last 200 lines. The command line for this run was:
time duplicity full -v9 --exclude-filelist excludeFromBackup-HDD.lst $PWD/JunkFromKitteh/ file://$HOME/tmp/tempbackup/

$PWD/JunkFromKitteh/ is the ~1700 files I tested with above.

All testing was on:
Ubuntu 15.10
duplicity 0.7.07.1 (from PPA)
Python 2.7.10
Target FS: local btrfs

Revision history for this message
Arthur Peters (amp) wrote :
Revision history for this message
Arthur Peters (amp) wrote :
Revision history for this message
Arthur Peters (amp) wrote :
Revision history for this message
Nate Eldredge (nate-thatsmathematics) wrote :

I don't know a lot about Python and how regexs work, but from a glance at globmatch.py, it looks like a glob is converted to a regex and compiled afresh every time we want to match it against something. That doesn't seem right - shouldn't we just do the conversion once and precompile all the regexs? Unless there is caching going on that I can't see.

It would be interesting to profile the code and see where the bulk of the time is being spent.

Revision history for this message
Nate Eldredge (nate-thatsmathematics) wrote :

Oh, it looks like maybe 0.8 will do exactly that: create a "selection function" for each include/exclude pattern that consists of a match against a precompiled regex.

Revision history for this message
Jan Kratochvil (jan-kratochvil) wrote :

It takes about 2 seconds to access every file on my system so it is not usable:

21:15:38.687856 lstat("/home/jkratoch/go/pkg/mod/cache/vcs/f48ede8dfaa35d7c678de93b2f8c2f0b065f00bc45dae12fbe0acf9c4359d518/objects/5b/5013cbedca3d5205478899e978eaca32a0f974", {st_mode=S_IFREG|0444, st_size=1042, ...}) = 0
21:15:41.111780 access("/home/jkratoch/go/pkg/mod/cache/vcs/f48ede8dfaa35d7c678de93b2f8c2f0b065f00bc45dae12fbe0acf9c4359d518/objects/5b/5013cbedca3d5205478899e978eaca32a0f974", R_OK) = 0
...
21:15:41.115985 lstat("/home/jkratoch/go/pkg/mod/cache/vcs/f48ede8dfaa35d7c678de93b2f8c2f0b065f00bc45dae12fbe0acf9c4359d518/objects/63", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
21:15:43.595162 access("/home/jkratoch/go/pkg/mod/cache/vcs/f48ede8dfaa35d7c678de93b2f8c2f0b065f00bc45dae12fbe0acf9c4359d518/objects/63", R_OK) = 0

My exclude list has 549706 lines/entries generated by this line as it makes no sense to backup files which are unmodified from system installation:

rpm -Vva 2>/dev/null|perl -ne 's/^[.]{7}[.T][.] . // or next;chomp;use Cwd qw(realpath);print "- ".realpath($_)."\n" if -f;' >that-exclude-file

Revision history for this message
Jan Kratochvil (jan-kratochvil) wrote :

Forgot to state version: duplicity-0.7.19-1.fc30.x86_64
I will try: duplicity-0.8.05-1.fc31

Revision history for this message
Jan Kratochvil (jan-kratochvil) wrote :

0.8.05 is only a bit faster but still unusably slow:
     0.000050 lstat("/etc/NetworkManager/system-connections/AMS2.ovpn", {st_mode=S_IFREG|0600, st_size=476, ...}) = 0
     1.466195 access("/etc/NetworkManager/system-connections/AMS2.ovpn", R_OK) = 0
     0.000236 lstat("/etc/NetworkManager/system-connections/BRQ.ovpn", {st_mode=S_IFREG|0600, st_size=429, ...}) = 0
     1.464853 access("/etc/NetworkManager/system-connections/BRQ.ovpn", R_OK) = 0

Revision history for this message
Jan Kratochvil (jan-kratochvil) wrote :

So wrote a reduction, it reduced -exclude-filelist entries on my system from 607481 to 42169.
Still it takes 0.12s to evaluate each file which will take 11+ days to do the backup.
Still much better than half a year (with no system restart possible) it would take before!
https://git.jankratochvil.net/?p=nethome.git;a=blob_plain;f=bin/rpmsafe
https://git.jankratochvil.net/?p=nethome.git;a=blob_plain;f=bin/rpmsafereduce

Revision history for this message
Nate Eldredge (nate-thatsmathematics) wrote : Re: [Bug 1576389] Re: Slow exclude pattern matching

On Mon, 4 Nov 2019, Jan Kratochvil wrote:

> So wrote a reduction, it reduced -exclude-filelist entries on my system from 607481 to 42169.
> Still it takes 0.12s to evaluate each file which will take 11+ days to do the backup.
> Still much better than half a year (with no system restart possible) it would take before!
> https://git.jankratochvil.net/?p=nethome.git;a=blob_plain;f=bin/rpmsafe
> https://git.jankratochvil.net/?p=nethome.git;a=blob_plain;f=bin/rpmsafereduce

Your use case seems like something way beyond what the exclude feature was
designed to handle. In particular I think it's rather inherently O(n*m)
where n is the number of exclude entries and m is the number of files to
be checked.

You can probably achieve something pretty close by just excluding a few
key directories. For example, typically /usr contains only files from the
package manager (with the exception of /usr/local), so you could just have
an exclude filelist containing

     + /usr/local
     /usr

and that should cover 95% of what you want to exclude.

If you really must exclude every single file, then since you don't really
need glob matching, you could hack a separate exclude mechanism into
duplicity; for instance, slurp your whole list into a Python set at
startup, and check each file against the set using `in'. The performance
cost of that should be negligible, and more like O(n+m) asymptotically.

--
Nate Eldredge
<email address hidden>

Revision history for this message
Jan Kratochvil (jan-kratochvil) wrote :

"something pretty close" + "should cover 95%" - I do not want to lose some my modified or added parts of my system, I have many custom additions+modifications of my system. This is what a backup is there for.

"you could hack" - I could if duplicity was written in some language acceptable for me. Nowadays I would prefer C++11 (or higher), for some hacks I still use Perl (although I agree it is also very horrible language although not as much horrible as Python is). Yes, that is my problem.

"need glob matching" - the matching could be done effectively even for globs (except any part after "**") as the matching should be decomposed into a memory tree structure corresponding to the disk tree structure. But that all requires coding in Python.

BTW one backup finished "just" after few hours, my time estimate was somehow wrong. So my solutions looks to be good enough for me.

Revision history for this message
Jan Kratochvil (jan-kratochvil) wrote :

Forgot to state this is really a regression.
In duplicity-0.6.25-3.fc21.x86_64 it was all perfectly fast, 0.7.x was already slow.
I was running 0.6.25 so far with some rebuilds but now Fedora 31 even dropped Python 2 support so that is no longer possible.

Changed in duplicity:
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.