After cold-migration of a volume-backed instance, disk.info file leftover on source host

Bug #1769131 reported by James McCarthy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Ocata
Fix Committed
Medium
Matt Riedemann
Pike
Fix Committed
Medium
Matt Riedemann
Queens
Fix Committed
Medium
Matt Riedemann

Bug Description

Tested using kolla-ansible, with kolla images stable/queens.

In this setup there are only two compute nodes, with cinder/lvm for storage.

A cirros instance is created, on compute02, and cold-migrated to compute01.

At the step where it's awaiting confirmation, the following files can be found:

compute01
/var/lib/docker/volumes/nova_compute/_data/instances
\-- 371e669b-0f15-49f2-9a84-bd1e89f34294
    \-- console.log

compute02
1 directory, 1 file
/var/lib/docker/volumes/nova_compute/_data/instances
\-- 371e669b-0f15-49f2-9a84-bd1e89f34294_resize
    \-- console.log

1 directory, 1 file

After confirming the migrate/resize, this becomes:

compute01
/var/lib/docker/volumes/nova_compute/_data/instances
\-- 371e669b-0f15-49f2-9a84-bd1e89f34294
    \-- console.log

compute02
1 directory, 1 file
/var/lib/docker/volumes/nova_compute/_data/instances
\-- 371e669b-0f15-49f2-9a84-bd1e89f34294
    \-- disk.info

1 directory, 1 file

This log shows how after the _resize information is cleaned up, that *after this, this file ends up on the source host, where it is left.

http://paste.openstack.org/show/720358/

2018-05-04 12:55:10.818 7 DEBUG nova.compute.manager [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] [instance: 371e669b-0f15-49f2-9a84-bd1e89f34294] Going to confirm migration 4 do_confirm_resize /usr/lib/python2.7/site-packages/nova/compute/manager.py:3684
2018-05-04 12:55:11.032 7 DEBUG oslo_concurrency.lockutils [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] Acquired semaphore "refresh_cache-371e669b-0f15-49f2-9a84-bd1e89f34294" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2018-05-04 12:55:11.033 7 DEBUG nova.network.neutronv2.api [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] [instance: 371e669b-0f15-49f2-9a84-bd1e89f34294] _get_instance_nw_info() _get_instance_nw_info /usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py:1383
2018-05-04 12:55:11.034 7 DEBUG nova.objects.instance [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] Lazy-loading 'info_cache' on Instance uuid 371e669b-0f15-49f2-9a84-bd1e89f34294 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:1052
2018-05-04 12:55:11.406 7 DEBUG nova.network.base_api [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] [instance: 371e669b-0f15-49f2-9a84-bd1e89f34294] Updating instance_info_cache with network_info: [{"profile": {}, "ovs_interfaceid": "ba8646b4-fa66-46b9-9f7e-a83163668bb8", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.8"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [{"meta": {}, "version": 4, "type": "dns", "address": "8.8.8.8"}], "routes": [], "cidr": "10.0.0.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "7ea70c4f74c24199b14df0a570b6f93e", "mtu": 1450}, "id": "f1d14432-5a26-4b0a-89e7-6683bd7d2477", "label": "demo-net"}, "devname": "tapba8646b4-fa", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:d9:91:37", "active": true, "type": "ovs", "id": "ba8646b4-fa66-46b9-9f7e-a83163668bb8", "qbg_params": null}] update_instance_cache_with_nw_info /usr/lib/python2.7/site-packages/nova/network/base_api.py:48
2018-05-04 12:55:11.426 7 DEBUG oslo_concurrency.lockutils [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] Releasing semaphore "refresh_cache-371e669b-0f15-49f2-9a84-bd1e89f34294" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228
2018-05-04 12:55:11.426 7 DEBUG oslo_concurrency.processutils [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] Running cmd (subprocess): rm -rf /var/lib/nova/instances/371e669b-0f15-49f2-9a84-bd1e89f34294_resize execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372
2018-05-04 12:55:11.459 7 DEBUG oslo_concurrency.processutils [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] CMD "rm -rf /var/lib/nova/instances/371e669b-0f15-49f2-9a84-bd1e89f34294_resize" returned: 0 in 0.033s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409
2018-05-04 12:55:11.462 7 DEBUG oslo_concurrency.lockutils [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] Lock "/var/lib/nova/instances/371e669b-0f15-49f2-9a84-bd1e89f34294/disk.info" acquired by "nova.virt.libvirt.imagebackend.write_to_disk_info_file" :: waited 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-05-04 12:55:11.462 7 DEBUG oslo_concurrency.lockutils [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] Lock "/var/lib/nova/instances/371e669b-0f15-49f2-9a84-bd1e89f34294/disk.info" released by "nova.virt.libvirt.imagebackend.write_to_disk_info_file" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2018-05-04 12:55:11.482 7 DEBUG nova.virt.libvirt.vif [req-510561e2-eabb-4c37-8fc3-d56e9f50bf6e 64ca3042227c48ea84d77461b14b8acb 7ea70c4f74c24199b14df0a570b6f93e - default default] vif_type=ovs instance=Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=True,availability_zone='nova',cell_name=None,cleaned=False,config_drive='',created_at=2018-05-04T11:53:34Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,device_metadata=<?>,disable_terminate=False,display_description=None,display_name='cirros',ec2_ids=<?>,ephemeral_gb=0,ephemeral_key_uuid=None,fault=<?>,flavor=Flavor(2),host='compute01',hostname='cirros',id=2,image_ref='',info_cache=InstanceInfoCache,instance_type_id=2,kernel_id='',key_data='ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDGUK82VwkyVJoNMlF5EhqfVaI+yOfhaMnMWbLg6ZDeKQjJ5gTZ7DvAfF2NOsyY9kYVo2ik3tQiVJmTyQbc4zQZN327PgnHm4HkmQUTx/pz57VfXzpGg1lQviGW8wr7+Pd7euMcazt2eZB3l4dL1xL96dSIoBzK0wG7B4KTEk8uWMhFkhVFrH6LQBtJSkrTkPWIafc3fv3XNhs4bo9mXQNOpWW6pJogx6FiPYqkFtynHdJTX0a/JcdJxmu/HPSwT3QmZ3yyasHQ1+It6Htte0P1ThdsMKavRD9Gki/r5cB2sUxUxbfSFMfiHdry7opefrbvRVU3G1xwKqrd9JdCCDe9 kolla@operator

This file should not be left on the source host.

For example, attempting to live-migrate back to this host results in a failure:

2018-05-04 13:45:40.546 7 ERROR nova.compute.manager [instance: 371e669b-0f15-49f2-9a84-bd1e89f34294] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7407, in pre_live_migration
2018-05-04 13:45:40.546 7 ERROR nova.compute.manager [instance: 371e669b-0f15-49f2-9a84-bd1e89f34294] raise exception.DestinationDiskExists(path=instance_dir)
2018-05-04 13:45:40.546 7 ERROR nova.compute.manager [instance: 371e669b-0f15-49f2-9a84-bd1e89f34294]
2018-05-04 13:45:40.546 7 ERROR nova.compute.manager [instance: 371e669b-0f15-49f2-9a84-bd1e89f34294] DestinationDiskExists: The supplied disk path (/var/lib/nova/instances/371e669b-0f15-49f2-9a84-bd1e89f34294) already exists, it is expected not to exist.

Tags: libvirt resize
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is likely a side effect of the fix for bug 1728603 https://review.openstack.org/#/c/516395/.

tags: added: libvirt resize
Matt Riedemann (mriedem)
Changed in nova:
status: New → Triaged
summary: - After cold-migration, disk.info file leftover on source host
+ After cold-migration of a volume-backed instance, disk.info file
+ leftover on source host
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/566367

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
James McCarthy (jmccarthy) wrote :

I tested this review in my environment, and with this code in place I could no longer reproduce the issue raised in bug 1769131

Note: In this bug images_type is default (ceph is not being used)

Similar to above, when cold migrating an instance from compute02 to compute01, at the step where it's awaiting confirmation for resize/migrate, the following files can be found:

compute01
/var/lib/kolla/var/lib/nova/instances/
\-- 0d548887-54bd-4e39-b57d-fdbaa05a31db
    \-- console.log

compute02
1 directory, 1 file
/var/lib/kolla/var/lib/nova/instances/
\-- 0d548887-54bd-4e39-b57d-fdbaa05a31db_resize
    \-- console.log

After confirming the migrate/resize, this becomes:

/var/lib/kolla/var/lib/nova/instances/
\-- 0d548887-54bd-4e39-b57d-fdbaa05a31db
    \-- console.log

1 directory, 1 file
/var/lib/kolla/var/lib/nova/instances/

0 directories, 0 files

Excellent !

No disk.info file is present on the source host (and the instance dir is also gone, which it should be)

For example, I was then able to live-migrate the instance (from compute01 back to compute02) since the pre_live_migration checks pass.

Revision history for this message
James McCarthy (jmccarthy) wrote :

Follow on comment;

Tested with patchset2 https://review.openstack.org/#/c/566367

Similar to above, no longer able to reproduce the issue raised in bug 1769131

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/567623

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/567625

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/567630

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/566367
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8e3385707cb1ced55cd12b1314d8c0b68d354c38
Submitter: Zuul
Branch: master

commit 8e3385707cb1ced55cd12b1314d8c0b68d354c38
Author: Matt Riedemann <email address hidden>
Date: Fri May 4 12:58:07 2018 -0400

    libvirt: check image type before removing snapshots in _cleanup_resize

    Change Ic683f83e428106df64be42287e2c5f3b40e73da4 added some disk
    cleanup logic to _cleanup_resize because some image backends (Qcow2,
    Flat and Ploop) will re-create the instance directory and disk.info
    file when initializing the image backend object.

    However, that change did not take into account volume-backed instances
    being resized will not have a root disk *and* if the local disk is
    shared storage, removing the instance directory effectively deletes
    the instance files, like the console.log, on the destination host
    as well. Change I29fac80d08baf64bf69e54cf673e55123174de2a was made
    to resolve that issue.

    However (see the pattern?), if you're doing a resize of a
    volume-backed instance that is not on shared storage, we won't remove
    the instance directory from the source host in _cleanup_resize. If the
    admin then later tries to live migrate the instance back to that host,
    it will fail with DestinationDiskExists in the pre_live_migration()
    method.

    This change is essentially a revert of
    I29fac80d08baf64bf69e54cf673e55123174de2a and alternate fix for
    Ic683f83e428106df64be42287e2c5f3b40e73da4. Since the root problem
    is that creating certain imagebackend objects will recreate the
    instance directory and disk.info on the source host, we simply need
    to avoid creating the imagebackend object. The only reason we are
    getting an imagebackend object in _cleanup_resize is to remove
    image snapshot clones, which is only implemented by the Rbd image
    backend. Therefore, we can check to see if the image type supports
    clones and if not, don't go through the imagebackend init routine
    that, for some, will recreate the disk.

    Change-Id: Ib10081150e125961cba19cfa821bddfac4614408
    Closes-Bug: #1769131
    Related-Bug: #1666831
    Related-Bug: #1728603

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/567623
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=174764340d3c965d31143b39af4ab2e8ecefe594
Submitter: Zuul
Branch: stable/queens

commit 174764340d3c965d31143b39af4ab2e8ecefe594
Author: Matt Riedemann <email address hidden>
Date: Fri May 4 12:58:07 2018 -0400

    libvirt: check image type before removing snapshots in _cleanup_resize

    Change Ic683f83e428106df64be42287e2c5f3b40e73da4 added some disk
    cleanup logic to _cleanup_resize because some image backends (Qcow2,
    Flat and Ploop) will re-create the instance directory and disk.info
    file when initializing the image backend object.

    However, that change did not take into account volume-backed instances
    being resized will not have a root disk *and* if the local disk is
    shared storage, removing the instance directory effectively deletes
    the instance files, like the console.log, on the destination host
    as well. Change I29fac80d08baf64bf69e54cf673e55123174de2a was made
    to resolve that issue.

    However (see the pattern?), if you're doing a resize of a
    volume-backed instance that is not on shared storage, we won't remove
    the instance directory from the source host in _cleanup_resize. If the
    admin then later tries to live migrate the instance back to that host,
    it will fail with DestinationDiskExists in the pre_live_migration()
    method.

    This change is essentially a revert of
    I29fac80d08baf64bf69e54cf673e55123174de2a and alternate fix for
    Ic683f83e428106df64be42287e2c5f3b40e73da4. Since the root problem
    is that creating certain imagebackend objects will recreate the
    instance directory and disk.info on the source host, we simply need
    to avoid creating the imagebackend object. The only reason we are
    getting an imagebackend object in _cleanup_resize is to remove
    image snapshot clones, which is only implemented by the Rbd image
    backend. Therefore, we can check to see if the image type supports
    clones and if not, don't go through the imagebackend init routine
    that, for some, will recreate the disk.

    Conflicts:
          nova/tests/unit/virt/libvirt/test_driver.py

    NOTE(mriedem): The conflict is due to not having change
    Icdd039bb4374269d9da38e7f8d2e15e05ca8aadb in Queens.

    Change-Id: Ib10081150e125961cba19cfa821bddfac4614408
    Closes-Bug: #1769131
    Related-Bug: #1666831
    Related-Bug: #1728603
    (cherry picked from commit 8e3385707cb1ced55cd12b1314d8c0b68d354c38)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/567625
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c72a0a7665e96219f0301525edc513dda07b320b
Submitter: Zuul
Branch: stable/pike

commit c72a0a7665e96219f0301525edc513dda07b320b
Author: Matt Riedemann <email address hidden>
Date: Fri May 4 12:58:07 2018 -0400

    libvirt: check image type before removing snapshots in _cleanup_resize

    Change Ic683f83e428106df64be42287e2c5f3b40e73da4 added some disk
    cleanup logic to _cleanup_resize because some image backends (Qcow2,
    Flat and Ploop) will re-create the instance directory and disk.info
    file when initializing the image backend object.

    However, that change did not take into account volume-backed instances
    being resized will not have a root disk *and* if the local disk is
    shared storage, removing the instance directory effectively deletes
    the instance files, like the console.log, on the destination host
    as well. Change I29fac80d08baf64bf69e54cf673e55123174de2a was made
    to resolve that issue.

    However (see the pattern?), if you're doing a resize of a
    volume-backed instance that is not on shared storage, we won't remove
    the instance directory from the source host in _cleanup_resize. If the
    admin then later tries to live migrate the instance back to that host,
    it will fail with DestinationDiskExists in the pre_live_migration()
    method.

    This change is essentially a revert of
    I29fac80d08baf64bf69e54cf673e55123174de2a and alternate fix for
    Ic683f83e428106df64be42287e2c5f3b40e73da4. Since the root problem
    is that creating certain imagebackend objects will recreate the
    instance directory and disk.info on the source host, we simply need
    to avoid creating the imagebackend object. The only reason we are
    getting an imagebackend object in _cleanup_resize is to remove
    image snapshot clones, which is only implemented by the Rbd image
    backend. Therefore, we can check to see if the image type supports
    clones and if not, don't go through the imagebackend init routine
    that, for some, will recreate the disk.

    Change-Id: Ib10081150e125961cba19cfa821bddfac4614408
    Closes-Bug: #1769131
    Related-Bug: #1666831
    Related-Bug: #1728603
    (cherry picked from commit 8e3385707cb1ced55cd12b1314d8c0b68d354c38)
    (cherry picked from commit 174764340d3c965d31143b39af4ab2e8ecefe594)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/567630
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a16fa14ce47bd2d3a5189047e9bd330a607ef3cc
Submitter: Zuul
Branch: stable/ocata

commit a16fa14ce47bd2d3a5189047e9bd330a607ef3cc
Author: Matt Riedemann <email address hidden>
Date: Fri May 4 12:58:07 2018 -0400

    libvirt: check image type before removing snapshots in _cleanup_resize

    Change Ic683f83e428106df64be42287e2c5f3b40e73da4 added some disk
    cleanup logic to _cleanup_resize because some image backends (Qcow2,
    Flat and Ploop) will re-create the instance directory and disk.info
    file when initializing the image backend object.

    However, that change did not take into account volume-backed instances
    being resized will not have a root disk *and* if the local disk is
    shared storage, removing the instance directory effectively deletes
    the instance files, like the console.log, on the destination host
    as well. Change I29fac80d08baf64bf69e54cf673e55123174de2a was made
    to resolve that issue.

    However (see the pattern?), if you're doing a resize of a
    volume-backed instance that is not on shared storage, we won't remove
    the instance directory from the source host in _cleanup_resize. If the
    admin then later tries to live migrate the instance back to that host,
    it will fail with DestinationDiskExists in the pre_live_migration()
    method.

    This change is essentially a revert of
    I29fac80d08baf64bf69e54cf673e55123174de2a and alternate fix for
    Ic683f83e428106df64be42287e2c5f3b40e73da4. Since the root problem
    is that creating certain imagebackend objects will recreate the
    instance directory and disk.info on the source host, we simply need
    to avoid creating the imagebackend object. The only reason we are
    getting an imagebackend object in _cleanup_resize is to remove
    image snapshot clones, which is only implemented by the Rbd image
    backend. Therefore, we can check to see if the image type supports
    clones and if not, don't go through the imagebackend init routine
    that, for some, will recreate the disk.

    Change-Id: Ib10081150e125961cba19cfa821bddfac4614408
    Closes-Bug: #1769131
    Related-Bug: #1666831
    Related-Bug: #1728603
    (cherry picked from commit 8e3385707cb1ced55cd12b1314d8c0b68d354c38)
    (cherry picked from commit 174764340d3c965d31143b39af4ab2e8ecefe594)
    (cherry picked from commit c72a0a7665e96219f0301525edc513dda07b320b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.4

This issue was fixed in the openstack/nova 17.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.3

This issue was fixed in the openstack/nova 16.1.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.1.2

This issue was fixed in the openstack/nova 15.1.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b2

This issue was fixed in the openstack/nova 18.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.