Mismatch happens between BDM and domain XML If instance does not respond to ACPI hotplug during detach/attach.

Bug #1374508 reported by git-harry on 2014-09-26
42
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Ryan McNair
Nominated for Liberty by Roman Podoliaka
nova (Ubuntu)
Medium
Unassigned
Trusty
Medium
Hua Zhang

Bug Description

tempest.api.compute.servers.test_server_rescue_negative:ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume

This test passes however it fails to properly cleanup after itself - the detach completes but without running the necessary iscsiadm commands.

In nova.virt.libvirt.volume.LibvirtISCSIVolumeDriver.disconnect_volume the list returned by self.connection._get_all_block_devices includes the host_device which means that self._disconnect_from_iscsi_portal is never run.

You can see evidence of this in /etc/iscsi/nodes as well as errors logged in /var/log/syslog

I'm guessing there is a race between the unrescue and the detach within libvirt. In nova.virt.libvirt.driver.LibvirtDriver.detach_volume if I put in a sleep before virt_dom.detachDeviceFlags(xml, flags) the detach appears to work properly however if I sleep after that line it does not appear to have any effect.

Boden R (boden) wrote :
git-harry (git-harry) wrote :

I don't see how this is a duplicate of #1317298 - that bug looks unrelated. This is not an issue with tempest timing out.

git-harry (git-harry) wrote :

I've removed the duplicate link.

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed

I added https://bugs.launchpad.net/cinder/+bug/1440201 as a duplicate of this bug.

Further investigation there are some differences between normal volume detachment and this error case of volume detachment.

(1) libvirtd.log of Normal volume detach
http://paste.openstack.org/show/202077/

(2) libvirtd.log of this error case during volume detach
http://paste.openstack.org/show/202078/

In the log (1), I can see the following logs which mentions removing the virtio-disk. However, I couldn't see same logs in the log (2).
--- snip ---
2015-04-10 14:46:33.098+0000: 787: debug : qemuProcessHandleDeviceDeleted:1349 : Device virtio-disk1 removed from domain 0x7f7da0026b60 instance-0000002e
2015-04-10 14:46:33.098+0000: 787: debug : qemuDomainRemoveDiskDevice:2444 : Removing disk virtio-disk1 from domain 0x7f7da0026b60 instance-0000002e

Also I confirmed following two points.
(3) Login the instance via ssh after reboot to confirm device list of instance.
(4) Check virsh dumpxml result to confirm the device list of instance.

As for (3), I couldn't see the device file(/dev/vdb) for the Cinder volume. This means that the device was already detached from the viewpoint of qemu-kvm.
As for (4), the Cinder volume configuration still remained in the XML file of the instance.So libvirt kept Cinder volume information even though there was no device in the VM.

Very odd status...

Jay Bryant (jsbryant) wrote :

Mitsuhiro,

I have now been able to recreate this in devstack. I think I understand why I was initially having trouble recreating. On the system where we originally saw this the VM was running RHEL instead of Cirros, so the reboot was taking much longer. The window to hit the timing issue is much smaller in devstack with Cirros. Anyway, now that I can recreate I plan to spend some time working on this next week.

Thanks for your continued assistance on this as well!
Jay

Tomoki Sekiyama (tsekiyama) wrote :

This issue ( self._disconnect_from_iscsi_portal is not called ) also happens when the guest kernel is crashed.

(1) In the guest, crash the kernel.
  echo 1 > /proc/sys/kernel/sysrq
  echo c > /proc/sysrq-trigger

(2) Rnu "nova volume-detach <server> <volume-id>"

This might be because guest doesn't prospectively respond to ACPI hotplug event.
While (re)-booting the guest OS, there also is the window the guest doesn't respond to the event.

Changed in nova:
assignee: nobody → Mitsuhiro Tanino (mitsuhiro-tanino)

Fix proposed to branch: master
Review: https://review.openstack.org/175067

Changed in nova:
status: Confirmed → In Progress
summary: - Possible race between unrescue and detach
+ Mismatch is happened between BDM and domain XML If instance does not
+ respond to ACPI hotplug during detach/attach.
summary: - Mismatch is happened between BDM and domain XML If instance does not
- respond to ACPI hotplug during detach/attach.
+ Mismatch happens between BDM and domain XML If instance does not respond
+ to ACPI hotplug during detach/attach.
Changed in nova:
assignee: Mitsuhiro Tanino (mitsuhiro-tanino) → Jay Bryant (jsbryant)
Changed in nova:
assignee: Jay Bryant (jsbryant) → Mitsuhiro Tanino (mitsuhiro-tanino)
Mike Perez (thingee) on 2015-07-08
tags: added: volumes
Ryan McNair (rdmcnair) wrote :

Able to reproduce this issue by issuing soft reboot in Horizon and immediately detaching a volume using "nova detach". On the Cinder side, the volume gets detached from the instance and is no longer "in-use".

However, the VM still has the mapping to the device on "/dev/vdb" and trying to write to the device results in a "write error: No space on disk". Also, libvirt still has this device mapping in "libvirt.driver._get_all_block_devices()".

Fix proposed to branch: master
Review: https://review.openstack.org/227851

Changed in nova:
assignee: Mitsuhiro Tanino (mitsuhiro-tanino) → Ryan McNair (rdmcnair)
Matt Riedemann (mriedem) on 2015-10-05
tags: added: libvirt

Change abandoned by Mitsuhiro Tanino (<email address hidden>) on branch: master
Review: https://review.openstack.org/175067
Reason: https://review.openstack.org/227851 covers this problem.

Reviewed: https://review.openstack.org/227851
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3a3fb3cfb2c41ad182545e47649ff12a4f3a743e
Submitter: Jenkins
Branch: master

commit 3a3fb3cfb2c41ad182545e47649ff12a4f3a743e
Author: Ryan McNair <email address hidden>
Date: Thu Sep 24 22:20:23 2015 +0000

    Add retry logic for detaching device using LibVirt

    Add retry logic for removing a disk device from the LibVirt
    guest domain XML. This is needed because immediately after a guest
    reboot, libvirtmod.virDomainDetachDeviceFlags() will silently fail
    to remove the mapping from the guest domain. The async retry
    behavior is done in Guest and is generic so it can be re-used by any other
    detaches which hit this same race condition.

    Change-Id: I983f80822a5c210929f33e1aa348a0fef91e890b
    Closes-Bug: #1374508

Changed in nova:
status: In Progress → Fix Released

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/260127
Reason: Not appropriate for stable/kilo.

This issue was fixed in the openstack/nova 13.0.0.0b2 development milestone.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/260127
Reason: This doesn't help on https://review.openstack.org/#/c/218355/ so dropping it again.

Reviewed: https://review.openstack.org/259685
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9701921f1e921885d7c65bfde9b3cc384389fe74
Submitter: Jenkins
Branch: stable/liberty

commit 9701921f1e921885d7c65bfde9b3cc384389fe74
Author: Ryan McNair <email address hidden>
Date: Thu Sep 24 22:20:23 2015 +0000

    Add retry logic for detaching device using LibVirt

    Add retry logic for removing a disk device from the LibVirt
    guest domain XML. This is needed because immediately after a guest
    reboot, libvirtmod.virDomainDetachDeviceFlags() will silently fail
    to remove the mapping from the guest domain. The async retry
    behavior is done in Guest and is generic so it can be re-used by any other
    detaches which hit this same race condition.

    Change-Id: I983f80822a5c210929f33e1aa348a0fef91e890b
    Closes-Bug: #1374508
    (cherry picked from commit 3a3fb3cfb2c41ad182545e47649ff12a4f3a743e)

tags: added: in-stable-liberty

This issue was fixed in the openstack/nova 12.0.2 release.

Eric Desrochers (slashd) on 2018-01-23
Changed in nova (Ubuntu Trusty):
assignee: nobody → Hua Zhang (zhhuabj)
importance: Undecided → Medium
James Page (james-page) on 2018-01-25
Changed in nova (Ubuntu):
status: New → Fix Released
importance: Undecided → Medium
Changed in nova (Ubuntu Trusty):
status: New → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers