Hang on first collection-status

Bug #394629 reported by Cristian Klein on 2009-07-02
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Duplicity
High
Unassigned

Bug Description

Duplicity: 0.6.01
Python: 2.6.2+
OS: Ubuntu 9.04

When running duplicity 0.6.01 with "collection-status" for the first time, it hangs after displaying the status of the collection. No CPU nor network activity occurs while it is hanging. If interrupted, "collection-status" works normally the second time.

Is there any way to reproduce this? Does it happen all the time, or just sometimes?

I'm having a hard time replicating the problem.

Kenneth Loafman a écrit :
> Is there any way to reproduce this? Does it happen all the time, or
> just sometimes?
>
> I'm having a hard time replicating the problem.

Try to „rm -rf ~/.cache/duplicity” the run duplicity without
„--archive-dir” nor „--name”.

Cristian KLEIN wrote:
> Kenneth Loafman a écrit :
>> Is there any way to reproduce this? Does it happen all the time, or
>> just sometimes?
>>
>> I'm having a hard time replicating the problem.
>
> Try to „rm -rf ~/.cache/duplicity” the run duplicity without
> „--archive-dir” nor „--name”.

No problem with that. It just recreates the dirs, does the resync, then
the collection status, no delay anywhere.

...Ken

Cristian Klein (cristiklein) wrote :

Kenneth Loafman a écrit :
> Cristian KLEIN wrote:
>> Kenneth Loafman a écrit :
>>> Is there any way to reproduce this? Does it happen all the time, or
>>> just sometimes?
>>>
>>> I'm having a hard time replicating the problem.
>> Try to „rm -rf ~/.cache/duplicity” the run duplicity without
>> „--archive-dir” nor „--name”.
>
> No problem with that. It just recreates the dirs, does the resync, then
> the collection status, no delay anywhere.

Strange. I can reproduce it every time. At the end, I have three GnuPG
processes which are:

gpg --status-fd 4 --passphrase-fd 8 --logger-fd 3 --batch --no-tty
--no-secmem-warning --use-agent --decrypt
gpg --status-fd 4 --passphrase-fd 9 --logger-fd 3 --batch --no-tty
--no-secmem-warning --use-agent --decrypt
gpg --status-fd 4 --passphrase-fd 10 --logger-fd 3 --batch --no-tty
--no-secmem-warning --use-agent --decrypt

If I kill them by hand, duplicity resumes.

I also tried removing everything duplicity-related and installing it
from the PPA, but the problem persists.

Cristian KLEIN wrote:
> Kenneth Loafman a écrit :
>> Cristian KLEIN wrote:
>>> Kenneth Loafman a écrit :
>>>> Is there any way to reproduce this? Does it happen all the time, or
>>>> just sometimes?
>>>>
>>>> I'm having a hard time replicating the problem.
>>> Try to „rm -rf ~/.cache/duplicity” the run duplicity without
>>> „--archive-dir” nor „--name”.
>> No problem with that. It just recreates the dirs, does the resync, then
>> the collection status, no delay anywhere.
>
> Strange. I can reproduce it every time. At the end, I have three GnuPG
> processes which are:
>
> gpg --status-fd 4 --passphrase-fd 8 --logger-fd 3 --batch --no-tty
> --no-secmem-warning --use-agent --decrypt
> gpg --status-fd 4 --passphrase-fd 9 --logger-fd 3 --batch --no-tty
> --no-secmem-warning --use-agent --decrypt
> gpg --status-fd 4 --passphrase-fd 10 --logger-fd 3 --batch --no-tty
> --no-secmem-warning --use-agent --decrypt
>
> If I kill them by hand, duplicity resumes.
>
> I also tried removing everything duplicity-related and installing it
> from the PPA, but the problem persists.

If you start off with a working archive, does it do the same thing? In
other words, if the first run fails and you rerun, does the second run
fail as well? Or, is it just the first run with an empty archive?

Cristian Klein (cristiklein) wrote :

Kenneth Loafman a écrit :
> Cristian KLEIN wrote:
>> Kenneth Loafman a écrit :
>>> Cristian KLEIN wrote:
>>>> Kenneth Loafman a écrit :
>>>>> Is there any way to reproduce this? Does it happen all the time, or
>>>>> just sometimes?
>>>>>
>>>>> I'm having a hard time replicating the problem.
>>>> Try to „rm -rf ~/.cache/duplicity” the run duplicity without
>>>> „--archive-dir” nor „--name”.
>>> No problem with that. It just recreates the dirs, does the resync, then
>>> the collection status, no delay anywhere.
>> Strange. I can reproduce it every time. At the end, I have three GnuPG
>> processes which are:
>>
>> gpg --status-fd 4 --passphrase-fd 8 --logger-fd 3 --batch --no-tty
>> --no-secmem-warning --use-agent --decrypt
>> gpg --status-fd 4 --passphrase-fd 9 --logger-fd 3 --batch --no-tty
>> --no-secmem-warning --use-agent --decrypt
>> gpg --status-fd 4 --passphrase-fd 10 --logger-fd 3 --batch --no-tty
>> --no-secmem-warning --use-agent --decrypt
>>
>> If I kill them by hand, duplicity resumes.
>>
>> I also tried removing everything duplicity-related and installing it
>> from the PPA, but the problem persists.
>
> If you start off with a working archive, does it do the same thing? In
> other words, if the first run fails and you rerun, does the second run
> fail as well? Or, is it just the first run with an empty archive?

It's just the first run with an empty archive. The second time it runs
flawlessly.

Cristian KLEIN wrote:
> Kenneth Loafman a écrit :
>> Cristian KLEIN wrote:
>>> Kenneth Loafman a écrit :
>>>> Cristian KLEIN wrote:
>>>>> Kenneth Loafman a écrit :
>>>>>> Is there any way to reproduce this? Does it happen all the time, or
>>>>>> just sometimes?
>>>>>>
>>>>>> I'm having a hard time replicating the problem.
>>>>> Try to „rm -rf ~/.cache/duplicity” the run duplicity without
>>>>> „--archive-dir” nor „--name”.
>>>> No problem with that. It just recreates the dirs, does the resync, then
>>>> the collection status, no delay anywhere.
>>> Strange. I can reproduce it every time. At the end, I have three GnuPG
>>> processes which are:
>>>
>>> gpg --status-fd 4 --passphrase-fd 8 --logger-fd 3 --batch --no-tty
>>> --no-secmem-warning --use-agent --decrypt
>>> gpg --status-fd 4 --passphrase-fd 9 --logger-fd 3 --batch --no-tty
>>> --no-secmem-warning --use-agent --decrypt
>>> gpg --status-fd 4 --passphrase-fd 10 --logger-fd 3 --batch --no-tty
>>> --no-secmem-warning --use-agent --decrypt
>>>
>>> If I kill them by hand, duplicity resumes.
>>>
>>> I also tried removing everything duplicity-related and installing it
>>> from the PPA, but the problem persists.
>> If you start off with a working archive, does it do the same thing? In
>> other words, if the first run fails and you rerun, does the second run
>> fail as well? Or, is it just the first run with an empty archive?
>
> It's just the first run with an empty archive. The second time it runs
> flawlessly.

Here's a patch, no guarantees, but it will change the order of the
initial collection status and sync. It's possible you've encountered
some sort of race condition, but who knows.

Apply this to /usr/bin/duplicity, or wherever your have duplicity.

Cristian Klein (cristiklein) wrote :

Kenneth Loafman a écrit :
> Cristian KLEIN wrote:
>> Kenneth Loafman a écrit :
>>> Cristian KLEIN wrote:
>>>> Kenneth Loafman a écrit :
>>>>> Cristian KLEIN wrote:
>>>>>> Kenneth Loafman a écrit :
>>>>>>> Is there any way to reproduce this? Does it happen all the time, or
>>>>>>> just sometimes?
>>>>>>>
>>>>>>> I'm having a hard time replicating the problem.
>>>>>> Try to „rm -rf ~/.cache/duplicity” the run duplicity without
>>>>>> „--archive-dir” nor „--name”.
>>>>> No problem with that. It just recreates the dirs, does the resync, then
>>>>> the collection status, no delay anywhere.
>>>> Strange. I can reproduce it every time. At the end, I have three GnuPG
>>>> processes which are:
>>>>
>>>> gpg --status-fd 4 --passphrase-fd 8 --logger-fd 3 --batch --no-tty
>>>> --no-secmem-warning --use-agent --decrypt
>>>> gpg --status-fd 4 --passphrase-fd 9 --logger-fd 3 --batch --no-tty
>>>> --no-secmem-warning --use-agent --decrypt
>>>> gpg --status-fd 4 --passphrase-fd 10 --logger-fd 3 --batch --no-tty
>>>> --no-secmem-warning --use-agent --decrypt
>>>>
>>>> If I kill them by hand, duplicity resumes.
>>>>
>>>> I also tried removing everything duplicity-related and installing it
>>>> from the PPA, but the problem persists.
>>> If you start off with a working archive, does it do the same thing? In
>>> other words, if the first run fails and you rerun, does the second run
>>> fail as well? Or, is it just the first run with an empty archive?
>> It's just the first run with an empty archive. The second time it runs
>> flawlessly.
>
> Here's a patch, no guarantees, but it will change the order of the
> initial collection status and sync. It's possible you've encountered
> some sort of race condition, but who knows.
>
> Apply this to /usr/bin/duplicity, or wherever your have duplicity.
>
>
> ** Attachment added: "duplicity-coll-status-hang.patch"
> http://launchpadlibrarian.net/28656433/duplicity-coll-status-hang.patch
>

Sorry, it doesn't solve the problem. Same effect: 3 GPG processes which,
if killed, duplicity resumes.

Cristian KLEIN wrote:
> Kenneth Loafman a écrit :
>> Cristian KLEIN wrote:
>>> Kenneth Loafman a écrit :
>>>> Cristian KLEIN wrote:
>>>>> Kenneth Loafman a écrit :
>>>>>> Cristian KLEIN wrote:
>>>>>>> Kenneth Loafman a écrit :
>>>>>>>> Is there any way to reproduce this? Does it happen all the time, or
>>>>>>>> just sometimes?
>>>>>>>>
>>>>>>>> I'm having a hard time replicating the problem.
>>>>>>> Try to „rm -rf ~/.cache/duplicity” the run duplicity without
>>>>>>> „--archive-dir” nor „--name”.
>>>>>> No problem with that. It just recreates the dirs, does the resync, then
>>>>>> the collection status, no delay anywhere.
>>>>> Strange. I can reproduce it every time. At the end, I have three GnuPG
>>>>> processes which are:
>>>>>
>>>>> gpg --status-fd 4 --passphrase-fd 8 --logger-fd 3 --batch --no-tty
>>>>> --no-secmem-warning --use-agent --decrypt
>>>>> gpg --status-fd 4 --passphrase-fd 9 --logger-fd 3 --batch --no-tty
>>>>> --no-secmem-warning --use-agent --decrypt
>>>>> gpg --status-fd 4 --passphrase-fd 10 --logger-fd 3 --batch --no-tty
>>>>> --no-secmem-warning --use-agent --decrypt
>>>>>
>>>>> If I kill them by hand, duplicity resumes.
>>>>>
>>>>> I also tried removing everything duplicity-related and installing it
>>>>> from the PPA, but the problem persists.
>>>> If you start off with a working archive, does it do the same thing? In
>>>> other words, if the first run fails and you rerun, does the second run
>>>> fail as well? Or, is it just the first run with an empty archive?
>>> It's just the first run with an empty archive. The second time it runs
>>> flawlessly.
>> Here's a patch, no guarantees, but it will change the order of the
>> initial collection status and sync. It's possible you've encountered
>> some sort of race condition, but who knows.
>>
>> Apply this to /usr/bin/duplicity, or wherever your have duplicity.
>>
>>
>> ** Attachment added: "duplicity-coll-status-hang.patch"
>> http://launchpadlibrarian.net/28656433/duplicity-coll-status-hang.patch
>>
>
> Sorry, it doesn't solve the problem. Same effect: 3 GPG processes which,
> if killed, duplicity resumes.

OK, will chase it some more. The 3 GPG processes are normal and pretty
much stay around through the whole run. When there's many more than
that, its a problem.

Do you have any idea what the '+' represents on the end of Python 2.6.2+
that you use? Is it normal Python?

Cristian Klein (cristiklein) wrote :

Kenneth Loafman a écrit :
> Cristian KLEIN wrote:
>> Kenneth Loafman a écrit :
>>> Cristian KLEIN wrote:
>>>> Kenneth Loafman a écrit :
>>>>> Cristian KLEIN wrote:
>>>>>> Kenneth Loafman a écrit :
>>>>>>> Cristian KLEIN wrote:
>>>>>>>> Kenneth Loafman a écrit :
>>>>>>>>> Is there any way to reproduce this? Does it happen all the time, or
>>>>>>>>> just sometimes?
>>>>>>>>>
>>>>>>>>> I'm having a hard time replicating the problem.
>>>>>>>> Try to „rm -rf ~/.cache/duplicity” the run duplicity without
>>>>>>>> „--archive-dir” nor „--name”.
>>>>>>> No problem with that. It just recreates the dirs, does the resync, then
>>>>>>> the collection status, no delay anywhere.
>>>>>> Strange. I can reproduce it every time. At the end, I have three GnuPG
>>>>>> processes which are:
>>>>>>
>>>>>> gpg --status-fd 4 --passphrase-fd 8 --logger-fd 3 --batch --no-tty
>>>>>> --no-secmem-warning --use-agent --decrypt
>>>>>> gpg --status-fd 4 --passphrase-fd 9 --logger-fd 3 --batch --no-tty
>>>>>> --no-secmem-warning --use-agent --decrypt
>>>>>> gpg --status-fd 4 --passphrase-fd 10 --logger-fd 3 --batch --no-tty
>>>>>> --no-secmem-warning --use-agent --decrypt
>>>>>>
>>>>>> If I kill them by hand, duplicity resumes.
>>>>>>
>>>>>> I also tried removing everything duplicity-related and installing it
>>>>>> from the PPA, but the problem persists.
>>>>> If you start off with a working archive, does it do the same thing? In
>>>>> other words, if the first run fails and you rerun, does the second run
>>>>> fail as well? Or, is it just the first run with an empty archive?
>>>> It's just the first run with an empty archive. The second time it runs
>>>> flawlessly.
>>> Here's a patch, no guarantees, but it will change the order of the
>>> initial collection status and sync. It's possible you've encountered
>>> some sort of race condition, but who knows.
>>>
>>> Apply this to /usr/bin/duplicity, or wherever your have duplicity.
>>>
>>>
>>> ** Attachment added: "duplicity-coll-status-hang.patch"
>>> http://launchpadlibrarian.net/28656433/duplicity-coll-status-hang.patch
>>>
>> Sorry, it doesn't solve the problem. Same effect: 3 GPG processes which,
>> if killed, duplicity resumes.
>
> OK, will chase it some more. The 3 GPG processes are normal and pretty
> much stay around through the whole run. When there's many more than
> that, its a problem.
>
> Do you have any idea what the '+' represents on the end of Python 2.6.2+
> that you use? Is it normal Python?

I used a Python version from the Debian repository which corrected some
bugs in 2.6.2, so the version returned was "2.6.2+". Now I am using
Ubuntu-shipped 9.04 Python, with version "2.6.2", so it can't be
Python's fault.

Could it be perhaps due to some backup-host specific properties? I run a
„restricted” SFTP server on it. However, all other duplicity operations
run flawlessly.

Peter Schuller (scode) wrote :

I saw this earlier today too the first time I ran collection status after upgrading from 0.5.18 to 0.6.01 on Python 2.5 and FreeBSD.

For some reason I am unable to reproduce it again, even after nuking ~/.cache. Possibly because I have tacked on a couple of 0.6.01 produced backup increments onto the backups.

Peter Schuller wrote:
> I saw this earlier today too the first time I ran collection status
> after upgrading from 0.5.18 to 0.6.01 on Python 2.5 and FreeBSD.
>
> For some reason I am unable to reproduce it again, even after nuking
> ~/.cache. Possibly because I have tacked on a couple of 0.6.01 produced
> backup increments onto the backups.

It looks like something is waiting, an IO operation, a thread,
something. If we could reproduce it, we could find it.

In a cron environment, a hang like this could be a real problem. It
might be possible to hang for days without getting noticed.

Peter Schuller (scode) wrote :

For the record I was able to trigger it again with my backup to rsync.net (the other one I tried was my S3 backup). So whatever the problem is it is definitely crossing backends.

As per previous discussion I too saw a GPG child process. Hitting ctrl-c yielded:

gpg: signal Interrupt caught ... exiting
Exception exceptions.KeyboardInterrupt in <module 'threading' from '/usr/local/lib/python2.5/threading.pyc'> ignored

I'll see if I can indeed reproduce it reliably and hopefully debug it.

Peter Schuller (scode) wrote :

After inserting a printout on launching gpg (the '!!!....') and one on the wait() in GPGFile.close(), and comparing with pstree output, it looks like it is the GPG process from the *FIRST* downloaded file, in my case, which hangs round (with BEGIN WAIT/END WAIT) for that one not beeing seen:

Synchronizing remote metadata to local cache...
Copying duplicity-full-signatures.20090608T005130Z.sigtar to local cache.
Using temporary directory /tmp/duplicity-egi7Lz-tempdir
Running 'sftp -oServerAliveInterval=15 -oServerAliveCountMax=1 USER@HOSTt' (attempt #1)
sftp command: 'get home_scode/duplicity-full-signatures.20090608T005130Z.sigtar.gpg /tmp/duplicity-egi7Lz-tempdir/mktemp-uv8C56-1'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Copying duplicity-full.20090407T000822Z.manifest to local cache.
Running 'sftp -oServerAliveInterval=15 -oServerAliveCountMax=1 USER@HOST' (attempt #1)
sftp command: 'get home_scode/duplicity-full.20090407T000822Z.manifest.gpg /tmp/duplicity-egi7Lz-tempdir/mktemp-yytp5I-2'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
BEGIN WAIT
END WAIT
Deleting /tmp/duplicity-egi7Lz-tempdir/mktemp-yytp5I-2
Copying duplicity-full.20090509T001249Z.manifest to local cache.
Running 'sftp -oServerAliveInterval=15 -oServerAliveCountMax=1 USER@HOST' (attempt #1)
sftp command: 'get home_scode/duplicity-full.20090509T001249Z.manifest.gpg /tmp/duplicity-egi7Lz-tempdir/mktemp-p36gUr-3'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
BEGIN WAIT
END WAIT
.....

The first file is also never deleted from temp. This may be related. The close of the gpg is in this case supposed to happen as a result of SrcIter's next() method in copy_to_local() in duplicity-bin. It gets closed when you reach EOF on the file being decrypted, right before StopIteration is raised.

The file is probably not being read to EOF, which might also explain why the time file is not deleted, The would-be "Deleted ..." message comes from path.py and, when actually called, is called via this path:

Traceback (most recent call last):
  File "./duplicity-bin", line 1146, in <module>
    with_tempdir(main)
  File "./duplicity-bin", line 1139, in with_tempdir
    fn()
  File "./duplicity-bin", line 1044, in main
    sync_archive(col_stats)
  File "./duplicity-bin", line 895, in sync_archive
    copy_to_local(fn)
  File "./duplicity-bin", line 865, in copy_to_local
    copy_raw(src_iter, globals.archive_dir.append(loc_name).name)
  File "./duplicity-bin", line 811, in copy_raw
    data = src_iter.next(block_size).data
  File "./duplicity-bin", line 843, in next
    self.fileobj.close()
  File "/usr/home/scode/bzr/duplicity/trunk/duplicity/dup_temp.py", line 214, in close
    hook()
  File "/usr/home/scode/bzr/duplicity/trunk/duplicity/dup_temp.py", line 107, in delete
    path.DupPath.delete(self)
  File "/usr/home/scode/bzr/duplicity/trunk/duplicity/path.py", line 537, in delete
    raise AssertionError("keffo")

(Moving to debugging another bug for a while, so wanted to dump this in the mean time before I forget the details)

Hangs also on cleanup, remove-older-than, essentially any first use of a new archive dir.

Changed in duplicity:
importance: Undecided → High
status: New → Confirmed
milestone: none → 0.6.02

I just pushed the fix to this bug to the trunk. Thanks Peter for the initial analysis. The bug was caused by the early out in GPG/GZIPWriteFile() that stops at a certain volume size. I had removed the size = sys.maxint args in to_remote() and to_final() after the change to block size. My mistake. GZipWriteFile() was taking an early out at 5MB and never finishing the iteration.

Changed in duplicity:
status: Confirmed → Fix Committed
Cristian Klein (cristiklein) wrote :

Kenneth Loafman a écrit :
> I just pushed the fix to this bug to the trunk. Thanks Peter for the
> initial analysis. The bug was caused by the early out in
> GPG/GZIPWriteFile() that stops at a certain volume size. I had removed
> the size = sys.maxint args in to_remote() and to_final() after the
> change to block size. My mistake. GZipWriteFile() was taking an early
> out at 5MB and never finishing the iteration.
>
> ** Changed in: duplicity
> Status: Confirmed => Fix Committed

Hello,

I'm sorry to tell you, but I still experience this bug even with the
last duplicity from bzr #553. I make sure that all previous traces of
duplicity were erased and that the patch in dup_temp.py was correctly
applied.

Cristian KLEIN wrote:
> Kenneth Loafman a écrit :
>> I just pushed the fix to this bug to the trunk. Thanks Peter for the
>> initial analysis. The bug was caused by the early out in
>> GPG/GZIPWriteFile() that stops at a certain volume size. I had removed
>> the size = sys.maxint args in to_remote() and to_final() after the
>> change to block size. My mistake. GZipWriteFile() was taking an early
>> out at 5MB and never finishing the iteration.
>>
>> ** Changed in: duplicity
>> Status: Confirmed => Fix Committed
>
> Hello,
>
> I'm sorry to tell you, but I still experience this bug even with the
> last duplicity from bzr #553. I make sure that all previous traces of
> duplicity were erased and that the patch in dup_temp.py was correctly
> applied.

I just pushed another release with, I hope, the final fix. Turns out
that this one was not findable with a debugger, in fact, it would pass
completely when run under a debugger. AArgh!

If you could please test this out in your environment, I could get a
release out soon.

Cristian Klein (cristiklein) wrote :

Kenneth Loafman a écrit :
> Cristian KLEIN wrote:
>> Kenneth Loafman a écrit :
>>> I just pushed the fix to this bug to the trunk. Thanks Peter for the
>>> initial analysis. The bug was caused by the early out in
>>> GPG/GZIPWriteFile() that stops at a certain volume size. I had removed
>>> the size = sys.maxint args in to_remote() and to_final() after the
>>> change to block size. My mistake. GZipWriteFile() was taking an early
>>> out at 5MB and never finishing the iteration.
>>>
>>> ** Changed in: duplicity
>>> Status: Confirmed => Fix Committed
>> Hello,
>>
>> I'm sorry to tell you, but I still experience this bug even with the
>> last duplicity from bzr #553. I make sure that all previous traces of
>> duplicity were erased and that the patch in dup_temp.py was correctly
>> applied.
>
> I just pushed another release with, I hope, the final fix. Turns out
> that this one was not findable with a debugger, in fact, it would pass
> completely when run under a debugger. AArgh!
>
> If you could please test this out in your environment, I could get a
> release out soon.

Good work! I downloaded the latest bzr and the hang does not occur any
more. Thank you for your patience!

Changed in duplicity:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers