I'm actually beginning to wonder if there is a flaw in the tempest tests rather than in QEMU. The " Unable to read from monitor: Connection reset by peer" error message can actually indicate that a second thread has killed QEMU, while the first thread is talking to it - so this is a potential alternative idea to explore vs my previous QEMU-SEGV bug theory.
I've been examining the screen-n-cpu.log file to see what happens with instance 90c79adf-4df1-497c-a786-13bdc5cca98d which is the one with the virDomainBlockJob error trace
First I see the snapshot process starting
2014-06-24 22:51:24.314 INFO nova.virt.libvirt.driver [req-e4651efe-7c84-4a57-bbb1-88b107d4a282 ImagesOneServerTestJSON-967160715 ImagesOneServerTestJSON-32972017] [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] Beginning live snapshot process
Then I see the something killing this very same instance
2014-06-24 22:54:51.033 16186 INFO nova.compute.manager [-] Lifecycle event 1 on VM 90c79adf-4df1-497c-a786-13bdc5cca98d
then we see the snapshot process crash & burn
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 646, in blockJobAbort
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] libvirtError: Unable to read from monitor: Connection reset by peer
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]
So this looks very much to me like something in the test is killing the instance while the snapshot is still being done
Now, as for why this doesn't affect non-live snapshots we were testing before...
For non-live snapshots, we issue a 'managedSave' call, this terminates the guest. Then we do the snapshot process. Then we start up the guest against from the managed save image. My guess is that this race-ing 'Terminate instance' call is happening while the guest is already shutdown and hence does not cause a failure of the test suite when doing non-live snapshot (or at least the window in which the race could hit is dramatically smaller).
So based on the sequence in the screen-n-cpu.log file my money is currently on a race in the test scripts where something explicitly kills the instance while snapshot is being taken, and that the non-live snapshot code is not exposed to the race.
I'm actually beginning to wonder if there is a flaw in the tempest tests rather than in QEMU. The " Unable to read from monitor: Connection reset by peer" error message can actually indicate that a second thread has killed QEMU, while the first thread is talking to it - so this is a potential alternative idea to explore vs my previous QEMU-SEGV bug theory.
I've been examining the screen-n-cpu.log file to see what happens with instance 90c79adf- 4df1-497c- a786-13bdc5cca9 8d which is the one with the virDomainBlockJob error trace
First I see the snapshot process starting
2014-06-24 22:51:24.314 INFO nova.virt. libvirt. driver [req-e4651efe- 7c84-4a57- bbb1-88b107d4a2 82 ImagesOneServer TestJSON- 967160715 ImagesOneServer TestJSON- 32972017] [instance: 90c79adf- 4df1-497c- a786-13bdc5cca9 8d] Beginning live snapshot process
Then I see the something killing this very same instance
2014-06-24 22:54:40.255 AUDIT nova.compute. manager [req-218dba14- 516c-4805- 9908-b55cd73a00 e5 ImagesOneServer TestJSON- 967160715 ImagesOneServer TestJSON- 32972017] [instance: 90c79adf- 4df1-497c- a786-13bdc5cca9 8d] Terminating instance
And a lifecycle event to show that it was killed
2014-06-24 22:54:51.033 16186 INFO nova.compute. manager [-] Lifecycle event 1 on VM 90c79adf- 4df1-497c- a786-13bdc5cca9 8d
then we see the snapshot process crash & burn
2014-06-24 22:54:52.973 16186 TRACE nova.compute. manager [instance: 90c79adf- 4df1-497c- a786-13bdc5cca9 8d] File "/usr/lib/ python2. 7/dist- packages/ libvirt. py", line 646, in blockJobAbort manager [instance: 90c79adf- 4df1-497c- a786-13bdc5cca9 8d] if ret == -1: raise libvirtError ('virDomainBloc kJobAbort( ) failed', dom=self) manager [instance: 90c79adf- 4df1-497c- a786-13bdc5cca9 8d] libvirtError: Unable to read from monitor: Connection reset by peer manager [instance: 90c79adf- 4df1-497c- a786-13bdc5cca9 8d]
2014-06-24 22:54:52.973 16186 TRACE nova.compute.
2014-06-24 22:54:52.973 16186 TRACE nova.compute.
2014-06-24 22:54:52.973 16186 TRACE nova.compute.
So this looks very much to me like something in the test is killing the instance while the snapshot is still being done
Now, as for why this doesn't affect non-live snapshots we were testing before...
For non-live snapshots, we issue a 'managedSave' call, this terminates the guest. Then we do the snapshot process. Then we start up the guest against from the managed save image. My guess is that this race-ing 'Terminate instance' call is happening while the guest is already shutdown and hence does not cause a failure of the test suite when doing non-live snapshot (or at least the window in which the race could hit is dramatically smaller).
So based on the sequence in the screen-n-cpu.log file my money is currently on a race in the test scripts where something explicitly kills the instance while snapshot is being taken, and that the non-live snapshot code is not exposed to the race.