Comment 4 for bug 899961

Revision history for this message
Michael Tokarev (mjt+launchpad-tls) wrote :

I tried bisecting it today, again. I think you did mean 145e11e840500e04a4d0a624918bb17596be19e9 as the "M" point (the large merge), not b195043003d90ea4027ea01cc7a6c974ac915108 (which is the first commit in that merge) ;) Actually I did the same already, just didn't post to the bugreport. But anyway.

The first bad commit inside the merge which shows the bad behavour is:

commit 68d100e905453ebbeea8e915f4f18a2bd4339fe8
Author: Kevin Wolf <email address hidden>
Date: Thu Jun 30 17:42:09 2011 +0200

    qcow2: Use coroutines

Note: the issue happens only with qcow2 being in use so far, directly or indirectly with -snapshot.

Also note that it only happens within kvm tree, ie:

 git checkout 68d100e905453ebbeea8e915f4f18a2bd4339fe8
 git merge --no-commit 145e11e840500e04a4d0a624918bb17596be19e9^ # the pre-merge point

I tried to debug it further but without much success.

I'm attaching an strace the above source (with a few debug fprintf(stderr)s added into qcow2 source around lock/unlock calls) running winXP guest from point where I hit "Reboot" button and up to the point where it stalls. Search for "qcow2: mutex_unlock" from the _end_ of the file.

What I also observed is -- it looks like it merely loses an interrupt somewere, it is enough to Ctrl+Z/fg it or to attach/detach strace to the process for it to "unstuck" and continue executing. Attaching gdb also makes it unstuck as demonstrated initially.

This qcow2 commit may actually be innocent: it just started using coroutines, and that's where the bug/problem might be, say, 64bit kernel gets confused by the process switching stack for example? I dunno.

Note again that switching to alternative coroutine implementations makes the issue go away. Ie, it only happens with mkcontext&Co coroutine implementation, and the commit which git bisect found to be "guilty" is the one which makes usage of coroutines.