Ubuntu

Interrupting commit can mark files as removed

Reported by Gareth White on 2009-12-10
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
High
John A Meinel
2.0
Medium
John A Meinel
bzr (Ubuntu)
Undecided
Unassigned

Bug Description

When you interrupt a commit to a smart server using ctrl-C, occasionally it will remove a directory of files and commit the change. (This may also happen without using a smart server but I haven't tested that.) As with bug 495011 this is difficult to reproduce as it depends on the timing.

See the attached log for more detail. Note that the log doesn't seem to mention the interrupt. In the process output you get something like:

Committing to: bzr://localhost/testrepo/branch1/
Exception exceptions.KeyboardInterrupt in 'bzrlib._dirstate_helpers_pyx.ProcessEntryC._update_current_block' ignored
missing dir1
modified dir1
missing dir1/file1.txt
modified dir1/file1.txt
missing dir1/file2.txt
modified dir1/file2.txt
missing dir1/file3.txt
modified dir1/file3.txt
Committed revision 3153.

I've reproduced this with Bazaar 2.0.1 and 2.1.0b3 using Python 2.5 on Windows XP SP3.

To reproduce:
Set up a branch with a few thousand files at .\testrepo\branch1
start bzr serve --allow-writes
bzr checkout --lightweight bzr://localhost/testrepo/branch1
cd branch1

Now repeat the following and eventually you'll see it remove a random directory of files:
bzr commit -m "test" --unchanged
Hit ctrl-C sometime after the "Committing to:" message appears.

Gareth White (gwhite) wrote :
Martin Pool (mbp) wrote :

'Remove' in the sense of 'deletes them from the working tree'? And this can really be reproduced by commits that don't change anything?

Changed in bzr:
status: New → Confirmed
importance: Undecided → Medium
importance: Medium → High
Gareth White (gwhite) wrote :

It removes them from version control but leaves them in the working tree. It's as if you did "bzr remove --keep <some random dir>". (Or, as if bzr thought a directory was missing on disk when it wasn't and therefore it decided to remove it during the commit.)

It can be reproduced by doing a commit that doesn't (shouldn't) change anything, as in the original description.

I've also managed to reproduce it on my OS X machine now as well. The two important factors are that:
1. The branch has a lot of files and directories in it (e.g. 10000 files, 1000 dirs). I reproduced it by putting 10 copies of the bzr source into the same branch.
2. You need to interrupt it at the correct time. This is sometime after it says "Committing to:", but *before* it displays the progress bar. If you try this a few times you'll see the message about "KeyboardInterrupt in 'bzrlib._dirstate_helpers_pyx". Repeat a few more times and eventually you'll see it remove a directory of files.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gareth White wrote:
> It removes them from version control but leaves them in the working
> tree. It's as if you did "bzr remove --keep <some random dir>". (Or, as
> if bzr thought a directory was missing on disk when it wasn't and
> therefore it decided to remove it during the commit.)
>
> It can be reproduced by doing a commit that doesn't (shouldn't) change
> anything, as in the original description.
>
> I've also managed to reproduce it on my OS X machine now as well. The two important factors are that:
> 1. The branch has a lot of files and directories in it (e.g. 10000 files, 1000 dirs). I reproduced it by putting 10 copies of the bzr source into the same branch.
> 2. You need to interrupt it at the correct time. This is sometime after it says "Committing to:", but *before* it displays the progress bar.
>If you try this a few times you'll see the message about
"KeyboardInterrupt in 'bzrlib._dirstate_helpers_pyx".
>Repeat a few more times and eventually you'll see it remove a directory
of files.

^- This specifically sounds like we are suppressing the exception
somewhere, and we shouldn't be. And so we end up not processing a
directory because of the exception, but the calling code isn't
propagating the exception and so it keeps going even though we've
omitted a dir.

Do you have any more context for this?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksw7IEACgkQJdeBCYSNAANxMQCg1IxbEHCx4P0k5siWM8FmYQrk
Dx8An3Tep/zP7SeFHrGtP5/Ad4E4NmdU
=L4vH
-----END PGP SIGNATURE-----

Andrew Bennetts (spiv) wrote :

John A Meinel wrote:
[...]
>
> ^- This specifically sounds like we are suppressing the exception
> somewhere, and we shouldn't be. And so we end up not processing a
> directory because of the exception, but the calling code isn't
> propagating the exception and so it keeps going even though we've
> omitted a dir.

I agree. I've just taken a quick look through the error handling in
_dirstate_helpers_pyx.pyx and _readdir_pyx.pyx (both native “except”
blocks and the checking of return values of various C calls), but I
haven't spotted any obvious problems.

But maybe the way _readdir_pyx handles EINTR from readdir isn't quite
right. I'm not sure, but I think we may need to call PyErr_CheckSignals
if we get EINTR, otherwise the signal that just arrived (potentially
SIGINT, i.e. KeyboardInterrupt) may not be noticed until some arbitrary
point later, causing KeyboardInterrupt at strange times? I've attached
a patch to _readdir_pyx.pyx that tries this.

-Andrew.

2009/12/23 Andrew Bennetts <email address hidden>:
> But maybe the way _readdir_pyx handles EINTR from readdir isn't quite
> right.  I'm not sure, but I think we may need to call PyErr_CheckSignals
> if we get EINTR, otherwise the signal that just arrived (potentially
> SIGINT, i.e. KeyboardInterrupt) may not be noticed until some arbitrary
> point later, causing KeyboardInterrupt at strange times?  I've attached
> a patch to _readdir_pyx.pyx that tries this.

Well spotted.

So if a sigint arrives during the read, the OS will run Python's
signal handler, which does Py_AddPendingCall, and it's possible
unhappiness will result if that stalls for too long.

http://docs.python.org/c-api/exceptions.html#PyErr_CheckSignals

also of interest

btw http://manpages.ubuntu.com/manpages/lucid/en/man3/siginterrupt.3posix.html

in your patch, I think we need to check the return code of
PyErr_CheckSignals and return - or will pyrex automatically take care
of that if an exception is raised?

Strictly speaking EAGAIN doesn't need this handling but it's probably harmless.

It looks like we actually need similar handling in every case where
this code is turning errno into an OSError or similar.
PyErr_CheckSignals is implicitly called by PyErr_SetFromErrno but
we're not doing that. Alternatively we could delete the manual
'raise' calls and just do PyErr_SetFromErrno but that would leave out
the messages. Perhaps we should add a common function to do this.

--
Martin <http://launchpad.net/~mbp/>

> in your patch, I think we need to check the return code of
> PyErr_CheckSignals and return - or will pyrex automatically take care
> of that if an exception is raised?

Pyrex will take care of that, that's what the 'except -1' in the import
declaration for PyErr_CheckSignals does.

> It looks like we actually need similar handling in every case where
> this code is turning errno into an OSError or similar.

Yeah. Or at least for every case where errno might be EINTR or
otherwise caused by the arrival of a signal... I'm not sure if there are
any?. A common function sounds like a good idea.

2009/12/23 Andrew Bennetts <email address hidden>:
> Yeah.  Or at least for every case where errno might be EINTR or
> otherwise caused by the arrival of a signal... I'm not sure if there are
> any?.  A common function sounds like a good idea.

There are some functions classified as 'fast' (if that's the right
word) that will always finish without being interrupted. time(2) is
an obvious example. However it's harmless to check and probably
simpler to just do it everywhere.

--
Martin <http://launchpad.net/~mbp/>

OK. The linked branch checks for EINTR consistently in _readdir_pyx, using a common function. The _dirstate_helpers_pyx extension doesn't seem to have any problematic raises like that (it doesn't do much (any?) IO directly, it calls into regular Python).

Unfortunately, I don't think this fixes the reporter's problem as he's on Windows XP. I see some questionable error handling in _walkdirs_win32.pyx (e.g. the TODO comment), but given how differently Ctrl-C works on Windows I don't really know if any of this is likely to be related.

Martin Pool (mbp) wrote :

That message

> Exception exceptions.KeyboardInterrupt in 'bzrlib._dirstate_helpers_pyx.ProcessEntryC._update_current_block' ignored

seems to indicate something is calling PyErr_WriteUnraisable, and there are not so many callers of that. If someone can reproduce this problem under a C debugger with an exception set there it may be obvious. Grepping for its callers I can't see one that seems likely to raise this message.

Gareth White (gwhite) wrote :
Download full text (21.6 KiB)

I've been able to reproduce this when running within gdb on OS X. This was using bzr 2.0.3/python 2.5.4. Let me know if you need any more info and I can try it again.

(gdb) r
Starting program: /usr/local/dbg/bin/python /Users/garethw/bzr/bin/bzr commit -m test --unchanged
Committing to: bzr://localhost/repo2/branch9/
^C
Program received signal SIGINT, Interrupt.
0x0004520c in int_new (type=0x17ef70, args=0x25fe250, kwds=0x0) at Objects/intobject.c:1018
1018 }
(gdb) bt
#0 0x0004520c in int_new (type=0x17ef70, args=0x25fe250, kwds=0x0) at Objects/intobject.c:1018
#1 0x0001004c in PyObject_Call (func=0x17ef70, arg=0x25fe250, kw=0x0) at Objects/abstract.c:1861
#2 0x000ef1fc in PyEval_CallObjectWithKeywords (func=0x17ef70, arg=0x25fe250, kw=0x0) at Python/ceval.c:3481
#3 0x0000ffd0 in PyObject_CallObject (o=0x17ef70, a=0x25fe250) at Objects/abstract.c:1852
#4 0x01225b3c in __pyx_f_6bzrlib_21_dirstate_helpers_pyx__pack_stat [inlined] () at bzrlib/_dirstate_helpers_pyx.c:2486
#5 0x01225b3c in __pyx_f_6bzrlib_21_dirstate_helpers_pyx__update_entry (__pyx_v_self=0x138f430, __pyx_v_entry=0x202d580, __pyx_v_abspath=0x13c2ac0, __pyx_v_stat_value=0x14a8770) at bzrlib/_dirstate_helpers_pyx.c:2649
#6 0x0122bd18 in __pyx_f_6bzrlib_21_dirstate_helpers_pyx_13ProcessEntryC__process_entry (__pyx_v_self=0x13c8e90, __pyx_v_entry=0x202d580, __pyx_v_path_info=0x14ad030) at bzrlib/_dirstate_helpers_pyx.c:3679
#7 0x0121abf0 in __pyx_f_6bzrlib_21_dirstate_helpers_pyx_13ProcessEntryC__loop_one_block (__pyx_v_self=0x13c8e90) at bzrlib/_dirstate_helpers_pyx.c:7094
#8 0x01218098 in __pyx_f_6bzrlib_21_dirstate_helpers_pyx_13ProcessEntryC__iter_next (__pyx_v_self=0x13c8e90) at bzrlib/_dirstate_helpers_pyx.c:6679
#9 0x01203dac in __pyx_f_6bzrlib_21_dirstate_helpers_pyx_13ProcessEntryC___next__ (__pyx_v_self=0x13c8e90) at bzrlib/_dirstate_helpers_pyx.c:5450
#10 0x000eaba0 in PyEval_EvalFrameEx (f=0x2120120, throwflag=0) at Python/ceval.c:2195
#11 0x0002f8d0 in gen_send_ex (gen=0x25eeb20, arg=0x0, exc=0) at Objects/genobject.c:82
#12 0x0003036c in gen_iternext (gen=0x25eeb20) at Objects/genobject.c:287
#13 0x000eaba0 in PyEval_EvalFrameEx (f=0x9dc210, throwflag=0) at Python/ceval.c:2195
#14 0x0002f8d0 in gen_send_ex (gen=0x25eeb48, arg=0x0, exc=0) at Objects/genobject.c:82
#15 0x0003036c in gen_iternext (gen=0x25eeb48) at Objects/genobject.c:287
#16 0x000eaba0 in PyEval_EvalFrameEx (f=0x11c3680, throwflag=0) at Python/ceval.c:2195
#17 0x000f0014 in fast_function (func=0x13ac6b0, pp_stack=0xbfffc6d0, n=1, na=1, nk=0) at Python/ceval.c:3698
#18 0x000efd1c in call_function (pp_stack=0xbfffc6d0, oparg=0) at Python/ceval.c:3633
#19 0x000eb050 in PyEval_EvalFrameEx (f=0x118a010, throwflag=0) at Python/ceval.c:2304
#20 0x000ed1ac in PyEval_EvalCodeEx (co=0x138eb60, globals=0x1390540, locals=0x0, args=0x13a90dc, argcount=2, kws=0x1173710, kwcount=11, defs=0x1331e74, defcount=18, closure=0x0) at Python/ceval.c:2875
#21 0x00040a0c in function_call (func=0x13ac3f0, arg=0x13a90d0, kw=0x...

Andrew Bennetts (spiv) wrote :

Thanks for reproducing under OS X. I suspect that OS X and other POSIX-like platforms will be fixed by my patch. I'm landing the probable fix for OS X etc on the 2.0 branch now.

Unfortunately I think there must be a separate cause on Windows, although presumably it is a similar sort of bug.

Gareth White (gwhite) wrote :

Hi Andrew,

I'm afraid I can still reproduce the issue on OS X even with your patch. So there must be something else contributing to the problem here.

I noticed that in _dirstate_helpers_pyx.c it appears that __pyx_f_6bzrlib_21_dirstate_helpers_pyx_13ProcessEntryC__update_current_block() is not propagating any exceptions that occur - instead it's calling __Pyx_WriteUnraisable() and returning.

According to http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/version/Doc/Manual/basics.html#mozTocId482761 this is the expected behaviour for "cdef void" functions unless you declare them with "except *".

Could this help explain the behaviour I'm seeing?

(Disclaimer: I'm a complete newbie when it comes to Pyrex so if I've just said something completely stupid or obvious please let me know. I'm not trying to fix this bug, I'm just curious to learn how this works.)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gareth White wrote:
> Hi Andrew,
>
> I'm afraid I can still reproduce the issue on OS X even with your patch.
> So there must be something else contributing to the problem here.
>
> I noticed that in _dirstate_helpers_pyx.c it appears that
> __pyx_f_6bzrlib_21_dirstate_helpers_pyx_13ProcessEntryC__update_current_block()
> is not propagating any exceptions that occur - instead it's calling
> __Pyx_WriteUnraisable() and returning.
>
> According to
> http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/version/Doc/Manual/basics.html#mozTocId482761
> this is the expected behaviour for "cdef void" functions unless you
> declare them with "except *".
>
> Could this help explain the behaviour I'm seeing?

I would think so. Specifically, this code path is the 'walk to next
directory' function. If an exception were raised, then likely the
current block pointer would get out of sync.

My guess is that the filesystem pointer would be at the next directory,
but the internal walker would still be at the current dir.

Going further, now that they are out of sync, the internal walker will
think that all of these directories have been removed, since they are
not found. (we walk the filesystem and internal pointer in lock-step,
and can tell if something has gone missing, or was added, etc.)

My guess is that we are seeing this happen in this particular location,
because 'osutils.is_inside()' is a pure-python function. And thus the
existing exception is 'trapped' and not raised until we get to this
point, at which point we see KeyboardInterrupt get raised, and it
confuses us here.

It doesn't look like the function has to be 'void', we could just do:

=== modified file 'bzrlib/_dirstate_helpers_pyx.pyx'
- --- bzrlib/_dirstate_helpers_pyx.pyx 2009-10-02 05:43:41 +0000
+++ bzrlib/_dirstate_helpers_pyx.pyx 2010-01-04 18:37:21 +0000
@@ -1392,7 +1392,7 @@
             # provide.
             self.search_specific_file_parents.add('')

- - cdef void _update_current_block(self):
+ cdef int _update_current_block(self) except -1:
         if (self.block_index < len(self.state._dirblocks) and
             osutils.is_inside(self.current_root,
self.state._dirblocks[self.block_index][0])):
             self.current_block = self.state._dirblocks[self.block_index]
@@ -1401,6 +1401,7 @@
         else:
             self.current_block = None
             self.current_block_list = None
+ return 0

     def __next__(self):
         # Simple thunk to allow tail recursion without pyrex confusion

And then the exception should get propagated, the walkers won't get out
of sync, and instead it will just cancel the current operation.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktCNYgACgkQJdeBCYSNAANqIgCeJlMizlBoRT2HNsHUZ+nLSXfi
0IAAn2SK+Pjr26TorP76dg55vORnmfyu
=GQZd
-----END PGP SIGNATURE-----

This would explain why the error can occur on Windows too. Nice.

John A Meinel (jameinel) on 2010-01-04
Changed in bzr:
status: Confirmed → Fix Committed
summary: - Interrupting commit to smart server sometimes removes files
+ Interrupting commit can mark files as removed
Changed in bzr:
assignee: nobody → John A Meinel (jameinel)
milestone: none → 2.1.0rc1
Gareth White (gwhite) wrote :

I can no longer reproduce the problem with your fix, so it does appear to be resolved now. Thanks!

John A Meinel (jameinel) on 2010-01-06
Changed in bzr:
status: Fix Committed → Fix Released
James Westby (james-w) on 2010-05-13
Changed in bzr (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments