I tried bisecting it today, again. I think you did mean 145e11e840500e04a4d0a624918bb17596be19e9 as the "M" point (the large merge), not b195043003d90ea4027ea01cc7a6c974ac915108 (which is the first commit in that merge) ;) Actually I did the same already, just didn't post to the bugreport. But anyway.
The first bad commit inside the merge which shows the bad behavour is:
commit 68d100e905453ebbeea8e915f4f18a2bd4339fe8
Author: Kevin Wolf <email address hidden>
Date: Thu Jun 30 17:42:09 2011 +0200
qcow2: Use coroutines
Note: the issue happens only with qcow2 being in use so far, directly or indirectly with -snapshot.
Also note that it only happens within kvm tree, ie:
git checkout 68d100e905453ebbeea8e915f4f18a2bd4339fe8
git merge --no-commit 145e11e840500e04a4d0a624918bb17596be19e9^ # the pre-merge point
I tried to debug it further but without much success.
I'm attaching an strace the above source (with a few debug fprintf(stderr)s added into qcow2 source around lock/unlock calls) running winXP guest from point where I hit "Reboot" button and up to the point where it stalls. Search for "qcow2: mutex_unlock" from the _end_ of the file.
What I also observed is -- it looks like it merely loses an interrupt somewere, it is enough to Ctrl+Z/fg it or to attach/detach strace to the process for it to "unstuck" and continue executing. Attaching gdb also makes it unstuck as demonstrated initially.
This qcow2 commit may actually be innocent: it just started using coroutines, and that's where the bug/problem might be, say, 64bit kernel gets confused by the process switching stack for example? I dunno.
Note again that switching to alternative coroutine implementations makes the issue go away. Ie, it only happens with mkcontext&Co coroutine implementation, and the commit which git bisect found to be "guilty" is the one which makes usage of coroutines.
I tried bisecting it today, again. I think you did mean 145e11e840500e0 4a4d0a624918bb1 7596be19e9 as the "M" point (the large merge), not b195043003d90ea 4027ea01cc7a6c9 74ac915108 (which is the first commit in that merge) ;) Actually I did the same already, just didn't post to the bugreport. But anyway.
The first bad commit inside the merge which shows the bad behavour is:
commit 68d100e905453eb beea8e915f4f18a 2bd4339fe8
Author: Kevin Wolf <email address hidden>
Date: Thu Jun 30 17:42:09 2011 +0200
qcow2: Use coroutines
Note: the issue happens only with qcow2 being in use so far, directly or indirectly with -snapshot.
Also note that it only happens within kvm tree, ie:
git checkout 68d100e905453eb beea8e915f4f18a 2bd4339fe8 4a4d0a624918bb1 7596be19e9^ # the pre-merge point
git merge --no-commit 145e11e840500e0
I tried to debug it further but without much success.
I'm attaching an strace the above source (with a few debug fprintf(stderr)s added into qcow2 source around lock/unlock calls) running winXP guest from point where I hit "Reboot" button and up to the point where it stalls. Search for "qcow2: mutex_unlock" from the _end_ of the file.
What I also observed is -- it looks like it merely loses an interrupt somewere, it is enough to Ctrl+Z/fg it or to attach/detach strace to the process for it to "unstuck" and continue executing. Attaching gdb also makes it unstuck as demonstrated initially.
This qcow2 commit may actually be innocent: it just started using coroutines, and that's where the bug/problem might be, say, 64bit kernel gets confused by the process switching stack for example? I dunno.
Note again that switching to alternative coroutine implementations makes the issue go away. Ie, it only happens with mkcontext&Co coroutine implementation, and the commit which git bisect found to be "guilty" is the one which makes usage of coroutines.