OpenStack Compute (nova)

Bug #1334398
Comment #18

Comment 18 for bug 1334398

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-06-26:

#18

I'm actually beginning to wonder if there is a flaw in the tempest tests rather than in QEMU. The " Unable to read from monitor: Connection reset by peer" error message can actually indicate that a second thread has killed QEMU, while the first thread is talking to it - so this is a potential alternative idea to explore vs my previous QEMU-SEGV bug theory.

I've been examining the screen-n-cpu.log file to see what happens with instance 90c79adf-4df1-497c-a786-13bdc5cca98d which is the one with the virDomainBlockJob error trace

First I see the snapshot process starting

2014-06-24 22:51:24.314 INFO nova.virt.libvirt.driver [req-e4651efe-7c84-4a57-bbb1-88b107d4a282 ImagesOneServerTestJSON-967160715 ImagesOneServerTestJSON-32972017] [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] Beginning live snapshot process

Then I see the something killing this very same instance

2014-06-24 22:54:40.255 AUDIT nova.compute.manager [req-218dba14-516c-4805-9908-b55cd73a00e5 ImagesOneServerTestJSON-967160715 ImagesOneServerTestJSON-32972017] [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] Terminating instance

And a lifecycle event to show that it was killed

2014-06-24 22:54:51.033 16186 INFO nova.compute.manager [-] Lifecycle event 1 on VM 90c79adf-4df1-497c-a786-13bdc5cca98d

then we see the snapshot process crash & burn

2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 646, in blockJobAbort
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] libvirtError: Unable to read from monitor: Connection reset by peer
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]

So this looks very much to me like something in the test is killing the instance while the snapshot is still being done

Now, as for why this doesn't affect non-live snapshots we were testing before...

For non-live snapshots, we issue a 'managedSave' call, this terminates the guest. Then we do the snapshot process. Then we start up the guest against from the managed save image. My guess is that this race-ing 'Terminate instance' call is happening while the guest is already shutdown and hence does not cause a failure of the test suite when doing non-live snapshot (or at least the window in which the race could hit is dramatically smaller).

So based on the sequence in the screen-n-cpu.log file my money is currently on a race in the test scripts where something explicitly kills the instance while snapshot is being taken, and that the non-live snapshot code is not exposed to the race.

I'm actually beginning to wonder if there is a flaw in the tempest tests rather than in QEMU.  The " Unable to read from monitor: Connection reset by peer" error message can actually indicate that a second thread has killed QEMU, while the first thread is talking to it - so this is a potential alternative idea to explore vs my previous QEMU-SEGV bug theory.

I've been examining the screen-n-cpu.log file to see what happens with instance  90c79adf-4df1-497c-a786-13bdc5cca98d which is the one with the virDomainBlockJob error trace

First I see the snapshot process starting

Then I see the something killing this very same instance

And a lifecycle event to show that it was killed

2014-06-24 22:54:51.033 16186 INFO nova.compute.manager [-] Lifecycle event 1 on VM 90c79adf-4df1-497c-a786-13bdc5cca98d

then we see the snapshot process crash & burn

2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 646, in blockJobAbort
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] libvirtError: Unable to read from monitor: Connection reset by peer
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]

So this looks very much to me like something in the test is killing the instance while the snapshot is still being done

Now, as for why this doesn't affect non-live snapshots we were testing before...

For non-live snapshots, we issue a 'managedSave' call, this terminates the guest. Then we do the snapshot process. Then we start up the guest against from the managed save image.  My guess is that this race-ing 'Terminate instance' call is happening while the guest is already shutdown and hence does not cause a failure of the test suite when doing  non-live snapshot (or at least the window in which the race could hit is dramatically smaller).

So based on the sequence in the screen-n-cpu.log file my money is currently on a race in the test scripts where something explicitly kills the instance  while snapshot is being taken, and that the non-live snapshot code is not exposed to the race.