Bug #1715569 “Live migration fails with an attached non-bootable...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Christoph Fiehe (fiehe) wrote on 2017-09-07:

#1

Download full text (5.3 KiB)

This is the corresponding libvirt configuration file of the vm.

#############################
instance-00000030.xml
#############################

<domain type='kvm'>
  <name>instance-00000030</name>
  <uuid>58538546-09f7-4efb-abe1-4eaf008fe756</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="16.0.0"/>
      <nova:name>test-vm02</nova:name>
      <nova:creationTime>2017-09-07 06:59:17</nova:creationTime>
      <nova:flavor name="m1.small">
        <nova:memory>2048</nova:memory>
        <nova:disk>20</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>1</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="dddfba8e02f746799a6408a523e6cd25">admin</nova:user>
        <nova:project uuid="ed2d2efd86dd40e7a45491d8502318d3">demo</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="b2b7cfc7-ce74-421d-98a9-79768f36e5e1"/>
    </nova:instance>
  </metadata>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <cputune>
    <shares>1024</shares>
  </cputune>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>OpenStack Foundation</entry>
      <entry name='product'>OpenStack Nova</entry>
      <entry name='version'>16.0.0</entry>
      <entry name='serial'>74bf283c-f6a8-c600-0293-894b59a50724</entry>
      <entry name='uuid'>58538546-09f7-4efb-abe1-4eaf008fe756</entry>
      <entry name='family'>Virtual Machine</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-i440fx-xenial'>hvm</type>
    <boot dev='hd'/>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>SandyBridge</model>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>
  <clock offset='utc'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/kvm-spice</emulator>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='none' discard='unmap'/>
      <auth username='cinder'>
        <secret type='ceph' uuid='e5479084-e43e-4a1e-959a-b9989f02e632'/>
      </auth>
      <source protocol='rbd' name='vms/58538546-09f7-4efb-abe1-4eaf008fe756_disk'>
        <host name='10.30.200.141' port='6789'/>
        <host name='10.30.200.142' port='6789'/>
        <host name='10.30.200.143' port='6789'/>
      </source>
      <target dev='sda' bus='scsi'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writ...

This is the corresponding libvirt configuration file of the vm.

#############################
instance-00000030.xml
#############################

<domain type='kvm'>
  <name>instance-00000030</name>
  <uuid>58538546-09f7-4efb-abe1-4eaf008fe756</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="16.0.0"/>
      <nova:name>test-vm02</nova:name>
      <nova:creationTime>2017-09-07 06:59:17</nova:creationTime>
      <nova:flavor name="m1.small">
        <nova:memory>2048</nova:memory>
        <nova:disk>20</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>1</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="dddfba8e02f746799a6408a523e6cd25">admin</nova:user>
        <nova:project uuid="ed2d2efd86dd40e7a45491d8502318d3">demo</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="b2b7cfc7-ce74-421d-98a9-79768f36e5e1"/>
    </nova:instance>
  </metadata>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <cputune>
    <shares>1024</shares>
  </cputune>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>OpenStack Foundation</entry>
      <entry name='product'>OpenStack Nova</entry>
      <entry name='version'>16.0.0</entry>
      <entry name='serial'>74bf283c-f6a8-c600-0293-894b59a50724</entry>
      <entry name='uuid'>58538546-09f7-4efb-abe1-4eaf008fe756</entry>
      <entry name='family'>Virtual Machine</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-i440fx-xenial'>hvm</type>
    <boot dev='hd'/>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>SandyBridge</model>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>
  <clock offset='utc'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/kvm-spice</emulator>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='none' discard='unmap'/>
      <auth username='cinder'>
        <secret type='ceph' uuid='e5479084-e43e-4a1e-959a-b9989f02e632'/>
      </auth>
      <source protocol='rbd' name='vms/58538546-09f7-4efb-abe1-4eaf008fe756_disk'>
        <host name='10.30.200.141' port='6789'/>
        <host name='10.30.200.142' port='6789'/>
        <host name='10.30.200.143' port='6789'/>
      </source>
      <target dev='sda' bus='scsi'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writeback' discard='unmap'/>
      <auth username='cinder'>
        <secret type='ceph' uuid='e5479084-e43e-4a1e-959a-b9989f02e632'/>
      </auth>
      <source protocol='rbd' name='volumes/volume-208ea468-b937-42fd-a1f3-167212a84357'>
        <host name='10.30.200.141' port='6789'/>
        <host name='10.30.200.142' port='6789'/>
        <host name='10.30.200.143' port='6789'/>
      </source>
      <target dev='sdb' bus='scsi'/>
      <serial>208ea468-b937-42fd-a1f3-167212a84357</serial>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='piix3-uhci'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <interface type='bridge'>
      <mac address='02:05:69:7e:5d:2e'/>
      <source bridge='qbracefe15a-cb'/>
      <target dev='tapacefe15a-cb'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <log file='/var/lib/nova/instances/58538546-09f7-4efb-abe1-4eaf008fe756/console.log' append='off'/>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <log file='/var/lib/nova/instances/58538546-09f7-4efb-abe1-4eaf008fe756/console.log' append='off'/>
      <target type='serial' port='0'/>
    </console>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes' listen='0.0.0.0' keymap='en-us'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <stats period='10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>
  </devices>
</domain>

Sylvain Bauza (sylvain-bauza) on 2017-09-12

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Low

Revision history for this message

Mike Lowe (jomlowe) wrote on 2017-10-05:

#2

I am also having this problem with ocata

Surya Seetharaman (tssurya) on 2017-10-16

Changed in nova:
assignee:	nobody → Surya Seetharaman (tssurya)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-06: Fix proposed to nova (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/518022

Changed in nova:
assignee:	Surya Seetharaman (tssurya) → Mike Lowe (jomlowe)
status:	Confirmed → In Progress

melanie witt (melwitt) on 2017-11-07

tags:

added: libvirt

Revision history for this message

Jay Pipes (jaypipes) wrote on 2017-11-07:

#4

Doing some investigation into this bug, I've found that the code that creates the XML snippet for disk devices hasn't changed in 2+ years.

The code is now in nova/virt/libvirt/migration.py now, but the code used to be in libvirt/driver.py. Sahid's refactoring patches to break driver.py into separate modules moved a lot of this code around but did not functionally change the code.

This particular section was moved in the following patch:

https://github.com/openstack/nova/commit/23191308ff1bb151e441e398d69e46fb845f4e65

I remember reviewing the patch in Gerrit. :)

https://review.openstack.org/#/c/299490/

What is strange to me is that this bug is only recently being reported. I can't imagine that this has been buggy for 2+ years and nobody has previously noticed.

Revision history for this message

Mike Lowe (jomlowe) wrote on 2017-11-07:

#5

I had attributed the change to RHEL 7 derived distros moving from libvirt 2.x to 3.x. My problems coincided with the upgrade from CentOS 7.3 to 7.4. This may not be the case as the original report was with Ubuntu 16.04.

Revision history for this message

Mike Lowe (jomlowe) wrote on 2017-11-07:

#6

Writing out the xml before and after the update the update function changes the address.

It turns this:

disk device="disk" type="network">
<driver cache="writeback" name="qemu" type="raw"/>
<auth username="cinder">
  <secret type="ceph" uuid="1a790a26-dd49-4825-8d16-3dd627cf05a9"/>
</auth>
<source name="cinder-volumes/volume-de400476-b68a-45a2-b04f-739313f42bef" protocol="rbd">
  <host name="172.16.128.101" port="6789"/>
  <host name="172.16.128.121" port="6789"/>
  <host name="172.16.128.130" port="6789"/>
</source>
<target bus="scsi" dev="sdb"/>
<serial>de400476-b68a-45a2-b04f-739313f42bef</serial>
<address bus="0" controller="0" target="0" type="drive" unit="1"/>
</disk>

Into this:

Revision history for this message

melanie witt (melwitt) wrote on 2017-11-21:

#7

Thanks Mike for providing detail of the change in the XML after the update function.

I dug around in the code and based on the XML excerpt you show, this appears to be a regression caused by this patch [1] (which was also backported to ocata) which made a change to the libvirt driver _get_volume_config function. It added code to set the 'address' XML element when 'bus' == 'iscsi' when previously the 'address' element was left unmodified. The patch sets the address.controller element to 0 and sets the address.unit element only if it's been set in disk_info. And it's only set in disk_info in the attach_volume function in the driver, so that's why we're not seeing it set to 1 as expected during a live migration (attach_volume is not called). So we see it getting set as 0 in the original error message in comment 1:

"Live Migration failure: unsupported configuration: Target device drive address 0:0:0 does not match source 0:0:1: libvirtError: unsupported configuration: Target device drive address 0:0:0 does not match source 0:0:1"

So it seems to be the right thing to do to avoid modifying the address element for live migration only (as the proposed patch is doing).

[1] https://review.openstack.org/#/c/459741

Revision history for this message

Jay Pipes (jaypipes) wrote on 2017-12-01:

#8

Good sleuthing, Melanie, I concur with your conclusion.

OpenStack Infra (hudson-openstack) on 2017-12-14

Changed in nova:
assignee:	Mike Lowe (jomlowe) → melanie witt (melwitt)

melanie witt (melwitt) on 2017-12-14

Changed in nova:
assignee:	melanie witt (melwitt) → Mike Lowe (jomlowe)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-15: Fix merged to nova (master)

#9

Reviewed: https://review.openstack.org/518022
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b196857f04e41dde294eaacc2c1a991807ecc829
Submitter: Zuul
Branch: master

commit b196857f04e41dde294eaacc2c1a991807ecc829
Author: Mike Lowe <email address hidden>
Date: Mon Nov 6 11:06:46 2017 -0500

live-mig: keep disk device address same

    During live migration disk devices are updated with the latest
    block device mapping information for volumes. Previously this
    relied on libvirt to assign addresses in order after the already
    assigned devices like the root disk had been accounted for. In
    the latest libvirt the unassigned devices are allocated first which
    makes the root disk address double allocated causing the migration to
    fail. A running instance should never have the hardware addresses
    of its disks changed mid flight. While disk address changes during
    live migration produce fatal errors for the operator it would likely
    cause errors inside the instance and unexpected behavior if the device
    addresses change during cold migrationt review. With this disk addresses are no
    longer updated with block device mapping information while every
    other element of the disk definition for a volume is updated.

Closes-Bug: 1715569

Change-Id: I17af9848f4c0edcbcb101b30e45ca4afa93dcdbb

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-26: Fix included in openstack/nova 17.0.0.0b3

#10

This issue was fixed in the openstack/nova 17.0.0.0b3 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-07: Fix proposed to nova (stable/pike)

#11

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/541642

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-08: Fix merged to nova (stable/pike)

#12

Reviewed: https://review.openstack.org/541642
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d91e881890933363550ffb9e114c6df4c32a4090
Submitter: Zuul
Branch: stable/pike

commit d91e881890933363550ffb9e114c6df4c32a4090
Author: Mike Lowe <email address hidden>
Date: Mon Nov 6 11:06:46 2017 -0500

live-mig: keep disk device address same

    During live migration disk devices are updated with the latest
    block device mapping information for volumes. Previously this
    relied on libvirt to assign addresses in order after the already
    assigned devices like the root disk had been accounted for. In
    the latest libvirt the unassigned devices are allocated first which
    makes the root disk address double allocated causing the migration to
    fail. A running instance should never have the hardware addresses
    of its disks changed mid flight. While disk address changes during
    live migration produce fatal errors for the operator it would likely
    cause errors inside the instance and unexpected behavior if the device
    addresses change during cold migrationt review. With this disk addresses are no
    longer updated with block device mapping information while every
    other element of the disk definition for a volume is updated.

Closes-Bug: 1715569

Change-Id: I17af9848f4c0edcbcb101b30e45ca4afa93dcdbb
(cherry picked from commit b196857f04e41dde294eaacc2c1a991807ecc829)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-15: Fix included in openstack/nova 16.1.0

#13

This issue was fixed in the openstack/nova 16.1.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-27: Fix merged to nova (stable/ocata)

#14

Reviewed: https://review.openstack.org/541904
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6b753f0b5ea734432d2f45fb11884ea1b110ed77
Submitter: Zuul
Branch: stable/ocata

commit 6b753f0b5ea734432d2f45fb11884ea1b110ed77
Author: Mike Lowe <email address hidden>
Date: Mon Nov 6 11:06:46 2017 -0500

live-mig: keep disk device address same

    During live migration disk devices are updated with the latest
    block device mapping information for volumes. Previously this
    relied on libvirt to assign addresses in order after the already
    assigned devices like the root disk had been accounted for. In
    the latest libvirt the unassigned devices are allocated first which
    makes the root disk address double allocated causing the migration to
    fail. A running instance should never have the hardware addresses
    of its disks changed mid flight. While disk address changes during
    live migration produce fatal errors for the operator it would likely
    cause errors inside the instance and unexpected behavior if the device
    addresses change during cold migrationt review. With this disk addresses are no
    longer updated with block device mapping information while every
    other element of the disk definition for a volume is updated.

Closes-Bug: 1715569

Change-Id: I17af9848f4c0edcbcb101b30e45ca4afa93dcdbb
(cherry picked from commit b196857f04e41dde294eaacc2c1a991807ecc829)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-02: Fix included in openstack/nova 15.1.1

#15

This issue was fixed in the openstack/nova 15.1.1 release.

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Low	Mike Lowe
Ocata	Fix Committed	Undecided	Christian Berendt
Pike	Fix Committed	Undecided	Sahid Orentino

OpenStack Compute (nova)

Live migration fails with an attached non-bootable Cinder volume (Pike)

Bug Description

Other bug subscribers

Remote bug watches