Mismatch happens between BDM and domain XML If instance does not respond to ACPI hotplug during detach/attach.

Bug #1374508 reported by git-harry
42
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Ryan McNair
Nominated for Liberty by Roman Podoliaka
nova (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
Triaged
Medium
Hua Zhang

Bug Description

tempest.api.compute.servers.test_server_rescue_negative:ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume

This test passes however it fails to properly cleanup after itself - the detach completes but without running the necessary iscsiadm commands.

In nova.virt.libvirt.volume.LibvirtISCSIVolumeDriver.disconnect_volume the list returned by self.connection._get_all_block_devices includes the host_device which means that self._disconnect_from_iscsi_portal is never run.

You can see evidence of this in /etc/iscsi/nodes as well as errors logged in /var/log/syslog

I'm guessing there is a race between the unrescue and the detach within libvirt. In nova.virt.libvirt.driver.LibvirtDriver.detach_volume if I put in a sleep before virt_dom.detachDeviceFlags(xml, flags) the detach appears to work properly however if I sleep after that line it does not appear to have any effect.

Revision history for this message
Boden R (boden) wrote :
Revision history for this message
git-harry (git-harry) wrote :

I don't see how this is a duplicate of #1317298 - that bug looks unrelated. This is not an issue with tempest timing out.

Revision history for this message
git-harry (git-harry) wrote :

I've removed the duplicate link.

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Mitsuhiro Tanino (mitsuhiro-tanino) wrote :

I added https://bugs.launchpad.net/cinder/+bug/1440201 as a duplicate of this bug.

Revision history for this message
Mitsuhiro Tanino (mitsuhiro-tanino) wrote :

Further investigation there are some differences between normal volume detachment and this error case of volume detachment.

(1) libvirtd.log of Normal volume detach
http://paste.openstack.org/show/202077/

(2) libvirtd.log of this error case during volume detach
http://paste.openstack.org/show/202078/

In the log (1), I can see the following logs which mentions removing the virtio-disk. However, I couldn't see same logs in the log (2).
--- snip ---
2015-04-10 14:46:33.098+0000: 787: debug : qemuProcessHandleDeviceDeleted:1349 : Device virtio-disk1 removed from domain 0x7f7da0026b60 instance-0000002e
2015-04-10 14:46:33.098+0000: 787: debug : qemuDomainRemoveDiskDevice:2444 : Removing disk virtio-disk1 from domain 0x7f7da0026b60 instance-0000002e

Also I confirmed following two points.
(3) Login the instance via ssh after reboot to confirm device list of instance.
(4) Check virsh dumpxml result to confirm the device list of instance.

As for (3), I couldn't see the device file(/dev/vdb) for the Cinder volume. This means that the device was already detached from the viewpoint of qemu-kvm.
As for (4), the Cinder volume configuration still remained in the XML file of the instance.So libvirt kept Cinder volume information even though there was no device in the VM.

Very odd status...

Revision history for this message
Jay Bryant (jsbryant) wrote :

Mitsuhiro,

I have now been able to recreate this in devstack. I think I understand why I was initially having trouble recreating. On the system where we originally saw this the VM was running RHEL instead of Cirros, so the reboot was taking much longer. The window to hit the timing issue is much smaller in devstack with Cirros. Anyway, now that I can recreate I plan to spend some time working on this next week.

Thanks for your continued assistance on this as well!
Jay

Revision history for this message
Tomoki Sekiyama (tsekiyama) wrote :

This issue ( self._disconnect_from_iscsi_portal is not called ) also happens when the guest kernel is crashed.

(1) In the guest, crash the kernel.
  echo 1 > /proc/sys/kernel/sysrq
  echo c > /proc/sysrq-trigger

(2) Rnu "nova volume-detach <server> <volume-id>"

This might be because guest doesn't prospectively respond to ACPI hotplug event.
While (re)-booting the guest OS, there also is the window the guest doesn't respond to the event.

Changed in nova:
assignee: nobody → Mitsuhiro Tanino (mitsuhiro-tanino)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/175067

Changed in nova:
status: Confirmed → In Progress
summary: - Possible race between unrescue and detach
+ Mismatch is happened between BDM and domain XML If instance does not
+ respond to ACPI hotplug during detach/attach.
summary: - Mismatch is happened between BDM and domain XML If instance does not
- respond to ACPI hotplug during detach/attach.
+ Mismatch happens between BDM and domain XML If instance does not respond
+ to ACPI hotplug during detach/attach.
Changed in nova:
assignee: Mitsuhiro Tanino (mitsuhiro-tanino) → Jay Bryant (jsbryant)
Changed in nova:
assignee: Jay Bryant (jsbryant) → Mitsuhiro Tanino (mitsuhiro-tanino)
Mike Perez (thingee)
tags: added: volumes
Revision history for this message
Ryan McNair (rdmcnair) wrote :

Able to reproduce this issue by issuing soft reboot in Horizon and immediately detaching a volume using "nova detach". On the Cinder side, the volume gets detached from the instance and is no longer "in-use".

However, the VM still has the mapping to the device on "/dev/vdb" and trying to write to the device results in a "write error: No space on disk". Also, libvirt still has this device mapping in "libvirt.driver._get_all_block_devices()".

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/227851

Changed in nova:
assignee: Mitsuhiro Tanino (mitsuhiro-tanino) → Ryan McNair (rdmcnair)
Matt Riedemann (mriedem)
tags: added: libvirt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Mitsuhiro Tanino (<email address hidden>) on branch: master
Review: https://review.openstack.org/175067
Reason: https://review.openstack.org/227851 covers this problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/227851
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3a3fb3cfb2c41ad182545e47649ff12a4f3a743e
Submitter: Jenkins
Branch: master

commit 3a3fb3cfb2c41ad182545e47649ff12a4f3a743e
Author: Ryan McNair <email address hidden>
Date: Thu Sep 24 22:20:23 2015 +0000

    Add retry logic for detaching device using LibVirt

    Add retry logic for removing a disk device from the LibVirt
    guest domain XML. This is needed because immediately after a guest
    reboot, libvirtmod.virDomainDetachDeviceFlags() will silently fail
    to remove the mapping from the guest domain. The async retry
    behavior is done in Guest and is generic so it can be re-used by any other
    detaches which hit this same race condition.

    Change-Id: I983f80822a5c210929f33e1aa348a0fef91e890b
    Closes-Bug: #1374508

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/259685

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/260127

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/kilo)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/260127
Reason: Not appropriate for stable/kilo.

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/nova 13.0.0.0b2

This issue was fixed in the openstack/nova 13.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/kilo)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/260127
Reason: This doesn't help on https://review.openstack.org/#/c/218355/ so dropping it again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/liberty)

Reviewed: https://review.openstack.org/259685
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9701921f1e921885d7c65bfde9b3cc384389fe74
Submitter: Jenkins
Branch: stable/liberty

commit 9701921f1e921885d7c65bfde9b3cc384389fe74
Author: Ryan McNair <email address hidden>
Date: Thu Sep 24 22:20:23 2015 +0000

    Add retry logic for detaching device using LibVirt

    Add retry logic for removing a disk device from the LibVirt
    guest domain XML. This is needed because immediately after a guest
    reboot, libvirtmod.virDomainDetachDeviceFlags() will silently fail
    to remove the mapping from the guest domain. The async retry
    behavior is done in Guest and is generic so it can be re-used by any other
    detaches which hit this same race condition.

    Change-Id: I983f80822a5c210929f33e1aa348a0fef91e890b
    Closes-Bug: #1374508
    (cherry picked from commit 3a3fb3cfb2c41ad182545e47649ff12a4f3a743e)

tags: added: in-stable-liberty
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/nova 12.0.2

This issue was fixed in the openstack/nova 12.0.2 release.

Eric Desrochers (slashd)
Changed in nova (Ubuntu Trusty):
assignee: nobody → Hua Zhang (zhhuabj)
importance: Undecided → Medium
James Page (james-page)
Changed in nova (Ubuntu):
status: New → Fix Released
importance: Undecided → Medium
Changed in nova (Ubuntu Trusty):
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.