Attached volume partition sometimes doesn't show up in guest if previous volume detach was not complete

Bug #1492026 reported by Matt Riedemann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Low
Unassigned

Bug Description

I've been trying to unskip the test_stamp_pattern scenario in tempest and it's been randomly failing (but very recreatable) because when we attach the 2nd volume (from volume snapshot) to the 2nd instance (from snapshot), the attach API operation is successful but the device partition never shows up on the guest.

https://review.openstack.org/#/c/218355/6,publish

There is dmesg debug output when the test fails and it clearly shows there is no vdb attach:

http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/console.html#_2015-09-02_17_48_53_884

Checking the libvirtd logs, test_stamp_pattern is running against domain 00000006 and I don't see any call to virDomainAttachDeviceFlags for that domain.

I do see it for others, like:
http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/logs/libvirt/libvirtd.txt.gz#_2015-09-02_17_25_13_297

So it appears that libvirt just completely fails to do the attach device call.

From the n-cpu log, here is where we've come back from connecting the iscsi volume via os-brick and we get the device xml for the attach all to libvirt:

http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/logs/screen-n-cpu.txt.gz#_2015-09-02_17_42_10_274

2015-09-02 17:42:10.274 DEBUG nova.virt.libvirt.config [req-6b7dde2f-4466-4cbc-a0c6-de05880c3275 tempest-TestStampPattern-1375625978 tempest-TestStampPattern-227460230] Generated XML ('<disk type="block" device="disk">\n <driver name="qemu" type="raw" cache="none"/>\n <source dev="/dev/disk/by-path/ip-10.0.0.126:3260-iscsi-iqn.2010-10.org.openstack:volume-c0604f38-eb3d-4840-8ee3-6b0444faddb5-lun-1"/>\n <target bus="virtio" dev="vdb"/>\n <serial>c0604f38-eb3d-4840-8ee3-6b0444faddb5</serial>\n</disk>\n',) to_xml /opt/stack/new/nova/nova/virt/libvirt/config.py:82

But there is no corresponding call for virDomainAttachDeviceFlags and that domain in libvirtd logs - but it doesn't fail either, it's like it's just silently ignored or something.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/220357

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: domain.attachDeviceFlags randomly doesn't do anything in libvirt

My current theory now is that test_stamp_pattern is racing with the volume detach on the first instance/volume combo where the partition isn't gone when we create the volume snapshot (even though cinder says the volume is detach and 'available'). So when we go to attach the volume snapshot to the 2nd instance, maybe there is something on the volume that makes libvirt think the partition is already there and doesn't do the attach. It's still odd that libvirt's virDomainAttachDeviceFlags wouldn't log a debug statement at all in that case - even if it was just ignoring it thinking it's already there - or raise an error or something. To test this, we can add some code to test_stamp_pattern to wait for the partition to be gone from the instance after we've detached the volume but before we take the snapshot of the volume.

Revision history for this message
Tony Breeds (o-tony) wrote :

After more digging I don't think this is a libvirt problem. it *could* be a qemu problem or guest kernel problem.

Looking at the logs starting at:
http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/console.html#_2015-09-02_17_48_53_884

You can trace the process of test_stamp_pattern() down to:
215: server_from_snapshot = self._boot_image(snapshot_image['id'])
http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/console.html#_2015-09-02_17_48_54_005
where we see:
2015-09-02 17:41:56,742 8653 DEBUG [tempest.scenario.manager] Creating a server (name: tempest-TestStampPattern-811565472, image: eb0ab3d7-8cbe-479e-9e7f-1d106f02ddf4, flavor: 42)

Checking the n-cpu logs we see:
http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/logs/screen-n-cpu.txt.gz#_2015-09-02_17_42_01_490
"
  <uuid>3ac46e96-ab71-4808-9fe0-4d97f16f9dca</uuid>
  <name>instance-00000062</name>
  <memory>65536</memory>
  <vcpu>1</vcpu>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="12.0.0"/>
      <nova:name>tempest-TestStampPattern-811565472</nova:name>
"
So we booted instance-00000062 to fulfil this request.

We then get down to:
226: device = self._attach_volume(server_from_snapshot,
227: volume_from_snapshot)
228: self._wait_for_volume_available_on_the_system(ip_for_snapshot, device)

We see the virDomainAttachDeviceFlags for instance-00000062 at:
http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/logs/libvirt/libvirtd.txt.gz#_2015-09-02_17_42_10_274

and from:
http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/logs/libvirt/libvirtd.txt.gz#_2015-09-02_17_42_10_829
to:
http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/logs/libvirt/libvirtd.txt.gz#_2015-09-02_17_42_10_841

We can see what appears to be libvirt successfully attaching the device to the guest The last command shows the new disk in the peripherals list.

I suspect qemu/ guest kernel as we can see at:
http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/console.html#_2015-09-02_17_48_53_953

that the new PCI device was never visible to the guest it should have been 00:05.0.

Revision history for this message
Tony Breeds (o-tony) wrote :

I'm totally fine with being told how I messed up that investigation.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/220357
Reason: Turns out that this wasn't the problem, we found the attach
call in the libvirtd logs, I just didn't see it the first
time:

http://logs.openstack.org/55/218355/6/check/gate-tempest-dsvm-full/e0da291/logs/libvirt/libvirtd.txt.gz#_2015-09-02_17_42_10_274

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: domain.attachDeviceFlags randomly doesn't do anything in libvirt

In talking with danpb about this today, our current theory is that due to the async nature of the virDomainDetachDeviceFlags call with libvirt and the fact that tempest isn't waiting for the partition to be gone from the guest before it starts taking snapshots of the volume and instance, that maybe the instance isn't 'clean' when it does the instance snapshot, so that when we boot the 2nd instance from the snapshot image things go wonky.

danpb said there is an VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED event in libvirt >= 1.1.1 which could be used to monitor for when the device is actually detached.

(9:42:11 AM) danpb: mriedem: we'd have to wire up the libvirt event thread to handle vc.domainEventRegisterAny(None, libvirt.VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED, myDomainEventDeviceRemovedCallback, None)
(9:42:24 AM) danpb: currently we only monitor for libvirt.VIR_DOMAIN_EVENT_ID_LIFECYCLE,

summary: - domain.attachDeviceFlags randomly doesn't do anything in libvirt
+ Attached volume partition sometimes doesn't show up in guest if previous
+ volume detach was not complete
Changed in nova:
importance: Medium → Low
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like bug 1374508 might be related to the issues I'm seeing in test_stamp_pattern.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.