nova instance remnant left behind after cold migration completes

Bug #1824858 reported by Wendy Mitchell on 2019-04-15
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Lee Yarwood
StarlingX
Low
hutianhao27

Bug Description

Brief Description
-----------------
After cold migration to a new worker node, instances remnants are left behind

Severity
--------
standard

Steps to Reproduce
------------------
worker nodes compute-1 and compute-2 have label remote-storage enabled
1. Launch instance on compute-1
2. cold migrate to compute-2
3. confirm cold migration to complete

Expected Behavior
------------------
Migration to compute-2 and cleanup on files on compute-1

Actual Behavior
----------------
At 16:35:24 cold migration for instance a416ead6-a17f-4bb9-9a96-3134b426b069 completed to compute-2 but the following path is left behind on compute-1
compute-1:/var/lib/nova/instances/a416ead6-a17f-4bb9-9a96-3134b426b069

compute-1:/var/lib/nova/instances$ ls
a416ead6-a17f-4bb9-9a96-3134b426b069 _base locks
a416ead6-a17f-4bb9-9a96-3134b426b069_resize compute_nodes lost+found

compute-1:/var/lib/nova/instances$ ls
a416ead6-a17f-4bb9-9a96-3134b426b069 _base compute_nodes locks lost+found

compute-1:/var/lib/nova/instances$ ls
a416ead6-a17f-4bb9-9a96-3134b426b069 _base compute_nodes locks lost+found

2019-04-15T16:35:24.646749 clear 700.010 Instance tenant2-migration_test-1 owned by tenant2 has been cold-migrated to host compute-2 waiting for confirmation tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical
2019-04-15T16:35:24.482575 log 700.168 Cold-Migrate-Confirm complete for instance tenant2-migration_test-1 enabled on host compute-2 tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical
2019-04-15T16:35:16.815223 log 700.163 Cold-Migrate-Confirm issued by tenant2 against instance tenant2-migration_test-1 owned by tenant2 on host compute-2 tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical
2019-04-15T16:35:10.030068 clear 700.009 Instance tenant2-migration_test-1 owned by tenant2 is cold migrating from host compute-1 tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical
2019-04-15T16:35:09.971414 set 700.010 Instance tenant2-migration_test-1 owned by tenant2 has been cold-migrated to host compute-2 waiting for confirmation tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical
2019-04-15T16:35:09.970212 log 700.162 Cold-Migrate complete for instance tenant2-migration_test-1 now enabled on host compute-2 waiting for confirmation tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical
2019-04-15T16:34:51.637687 set 700.009 Instance tenant2-migration_test-1 owned by tenant2 is cold migrating from host compute-1 tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical
2019-04-15T16:34:51.637636 log 700.158 Cold-Migrate inprogress for instance tenant2-migration_test-1 from host compute-1 tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical
2019-04-15T16:34:51.478442 log 700.157 Cold-Migrate issued by tenant2 against instance tenant2-migration_test-1 owned by tenant2 from host compute-1 tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical
2019-04-15T16:34:20.181155 log 700.101 Instance tenant2-migration_test-1 is enabled on host compute-1 tenant=7f1d4223-3341-428a-9188-55614770e676.instance=a416ead6-a17f-4bb9-9a96-3134b426b069 critical

see nova-compute.log (compute-1)
compute-1 nova-compute log

[instance: a416ead6-a17f-4bb9-9a96-3134b426b069 claimed and spawned here on compute-1]

{"log":"2019-04-15 16:34:04,617.617 60908 INFO nova.compute.claims [req-f1195bbb-d5b0-4a75-a598-ff287d247643 3fd3229d3e6248cf9b5411b2ecec86e9 7f1d42233341428a918855614770e676 - default default] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Claim successful on node compute-1\n","stream":"stdout","time":"2019-04-15T16:34:04.617671485Z"}
{"log":"2019-04-15 16:34:07,836.836 60908 INFO nova.virt.libvirt.driver [req-f1195bbb-d5b0-4a75-a598-ff287d247643 3fd3229d3e6248cf9b5411b2ecec86e9 7f1d42233341428a918855614770e676 - default default] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names\n","stream":"stdout","time":"2019-04-15T16:34:07.836900621Z"}
{"log":"2019-04-15 16:34:08,000.000 60908 INFO nova.virt.block_device [req-f1195bbb-d5b0-4a75-a598-ff287d247643 3fd3229d3e6248cf9b5411b2ecec86e9 7f1d42233341428a918855614770e676 - default default] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Booting with volume 78db19a5-b699-407e-bbfb-7addeec8abdc at /dev/vda\n","stream":"stdout","time":"2019-04-15T16:34:08.00120626Z"}

{"log":"2019-04-15 16:34:15,416.416 60908 INFO nova.virt.libvirt.driver [req-f1195bbb-d5b0-4a75-a598-ff287d247643 3fd3229d3e6248cf9b5411b2ecec86e9 7f1d42233341428a918855614770e676 - default default] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Creating image\n","stream":"stdout","time":"2019-04-15T16:34:15.421820953Z"}
{"log":"2019-04-15 16:34:18,225.225 60908 INFO nova.compute.manager [-] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] VM Started (Lifecycle Event)\n","stream":"stdout","time":"2019-04-15T16:34:18.226158194Z"}

{"log":"2019-04-15 16:34:18,256.256 60908 INFO nova.compute.manager [req-a1def21d-e38b-4cbe-a527-2182e896aa3e - - - - -] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] VM Paused (Lifecycle Event)\n","stream":"stdout","time":"2019-04-15T16:34:18.256867098Z"}

{"log":"2019-04-15 16:34:18,310.310 60908 INFO nova.compute.manager [req-a1def21d-e38b-4cbe-a527-2182e896aa3e - - - - -] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] During sync_power_state the instance has a pending task (spawning). Skip.\n","stream":"stdout","time":"2019-04-15T16:34:18.31100974Z"}

{"log":"2019-04-15 16:34:20,092.092 60908 INFO nova.compute.manager [req-a1def21d-e38b-4cbe-a527-2182e896aa3e - - - - -] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] VM Resumed (Lifecycle Event)\n","stream":"stdout","time":"2019-04-15T16:34:20.093107332Z"}

{"log":"2019-04-15 16:34:20,095.095 60908 INFO nova.virt.libvirt.driver [-] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Instance spawned successfully.\n","stream":"stdout","time":"2019-04-15T16:34:20.095798258Z"}

{"log":"2019-04-15 16:34:20,095.095 60908 INFO nova.compute.manager [req-f1195bbb-d5b0-4a75-a598-ff287d247643 3fd3229d3e6248cf9b5411b2ecec86e9 7f1d42233341428a918855614770e676 - default default] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Took 4.68 seconds to spawn the instance on the hypervisor.\n","stream":"stdout","time":"2019-04-15T16:34:20.09609328Z"}

{"log":"2019-04-15 16:34:20,149.149 60908 INFO nova.compute.manager [req-a1def21d-e38b-4cbe-a527-2182e896aa3e - - - - -] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] During sync_power_state the instance has a pending task (spawning). Skip.\n","stream":"stdout","time":"2019-04-15T16:34:20.149969266Z"}
{"log":"2019-04-15 16:34:20,149.149 60908 INFO nova.compute.manager [req-a1def21d-e38b-4cbe-a527-2182e896aa3e - - - - -] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] VM Resumed (Lifecycle Event)\n","stream":"stdout","time":"2019-04-15T16:34:20.150156576Z"}
{"log":"2019-04-15 16:34:20,207.207 60908 INFO nova.compute.manager [req-f1195bbb-d5b0-4a75-a598-ff287d247643 3fd3229d3e6248cf9b5411b2ecec86e9 7f1d42233341428a918855614770e676 - default default] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Took 15.62 seconds to build instance.\n","stream":"stdout","time":"2019-04-15T16:34:20.207528448Z"}

{"log":"2019-04-15 16:34:59,463.463 60908 INFO nova.virt.libvirt.driver [req-b1f1706c-82a8-4367-9459-d67a7eb32f34 834c06b5424947ae8c6c0882b12909b2 0d8b1b1835d444fd81d66c013cdbef4c - default default] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Instance shutdown successfully after 2 seconds.\n","stream":"stdout","time":"2019-04-15T16:34:59.463555102Z"}

{"log":"2019-04-15 16:34:59,466.466 60908 INFO nova.virt.libvirt.driver [-] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Instance destroyed successfully.\n","stream":"stdout","time":"2019-04-15T16:34:59.466515441Z"}
{"log":"2019-04-15 16:35:00,660.660 60908 WARNING nova.compute.manager [req-36b50051-d7d9-4182-84ad-c0f42252b627 a3fa585069f54c3899d56994fe5bc701 90af2d0de6d74a59b484ab34698569e7 - 6b9be1b871584ab0b1f767f0dc33e402 6b9be1b871584ab0b1f767f0dc33e402] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Received unexpected event network-vif-unplugged-740df7da-cbf8-4f04-9ab1-35b297e7fc95 for instance with vm_state active and task_state resize_migrated.\n","stream":"stdout","time":"2019-04-15T16:35:00.662300475Z"}

{"log":"2019-04-15 16:35:05,938.938 60908 WARNING nova.compute.manager [req-4d3715fe-43b3-4e1e-b99f-3db3b5e60583 a3fa585069f54c3899d56994fe5bc701 90af2d0de6d74a59b484ab34698569e7 - 6b9be1b871584ab0b1f767f0dc33e402 6b9be1b871584ab0b1f767f0dc33e402] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Received unexpected event network-vif-unplugged-99f054a9-7469-449a-9623-17c7870dfd00 for instance with vm_state active and task_state resize_finish.\n","stream":"stdout","time":"2019-04-15T16:35:05.938732459Z"}
{"log":"2019-04-15 16:35:05,941.941 60908 INFO nova.compute.resource_tracker [req-bafc620b-ee9c-4f91-a544-39921c43e529 - - - - -] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Updating resource usage from migration ccda6a14-82e3-4965-a61f-a04e2e6cebf5\n","stream":"stdout","time":"2019-04-15T16:35:05.941818948Z"}

[cold migrated to compute-2 here]
{"log":"2019-04-15 16:35:10,212.212 60908 WARNING nova.compute.manager [req-ef46a902-0961-4446-bde8-2f235f5da2c8 a3fa585069f54c3899d56994fe5bc701 90af2d0de6d74a59b484ab34698569e7 - 6b9be1b871584ab0b1f767f0dc33e402 6b9be1b871584ab0b1f767f0dc33e402] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Received unexpected event network-vif-plugged-740df7da-cbf8-4f04-9ab1-35b297e7fc95 for instance with vm_state resized and task_state None.\n","stream":"stdout","time":"2019-04-15T16:35:10.212577086Z"}
{"log":"2019-04-15 16:35:12,263.263 60908 WARNING nova.compute.manager [req-420edc8f-cb8b-48bb-b036-46fec11bbad2 a3fa585069f54c3899d56994fe5bc701 90af2d0de6d74a59b484ab34698569e7 - 6b9be1b871584ab0b1f767f0dc33e402 6b9be1b871584ab0b1f767f0dc33e402] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] Received unexpected event network-vif-plugged-99f054a9-7469-449a-9623-17c7870dfd00 for instance with vm_state resized and task_state None.\n","stream":"stdout","time":"2019-04-15T16:35:12.263926304Z"}
{"log":"2019-04-15 16:35:14,228.228 60908 INFO nova.compute.manager [-] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] VM Stopped (Lifecycle Event)\n","stream":"stdout","time":"2019-04-15T16:35:14.228512171Z"}

{"log":"2019-04-15 16:35:14,289.289 60908 INFO nova.compute.manager [req-84c2832a-2f74-41f0-be83-c54b0afcd1fb - - - - -] [instance: a416ead6-a17f-4bb9-9a96-3134b426b069] During the sync_power process the instance has moved from host compute-2 to host compute-1\n","stream":"stdout","time":"2019-04-15T16:35:14.289417626Z"}

Reproducibility
---------------
yes

System Configuration
--------------------
Multi-node system (remote)

Branch/Pull Time/Commit
-----------------------
BUILD_TYPE="Formal"
BUILD_ID="20190410T013000Z"

Timestamp/Logs
--------------
see inline logs above

Ghada Khalil (gkhalil) wrote :

Assigning to the distro.openstack PL. Need team to determine if this is expected openstack behavior or if this is a bug that requires nova follow-up.

Gating decision TBD based on assessment

tags: added: stx.distro.openstack
Changed in starlingx:
assignee: nobody → Bruce Jones (brucej)
status: New → Incomplete
Bruce Jones (brucej) wrote :

StarlingX is now running against upstream OpenStack Nova master, so this issue is possibly a Nova issue.

Bruce Jones (brucej) wrote :

Wendy, can you please re-test and check to see if migration back to the original host is impacted by this issue? That would help us determine the Importance of this.

Changed in starlingx:
assignee: Bruce Jones (brucej) → Wendy Mitchell (wmitchellwr)
Artom Lifshitz (notartom) wrote :

I wasn't able to reproduce this on Nova master. I have vague memories of something like this being reported in older versions, but having been fixed since then, though I can't find hard evidence in the shape of bug reports and/or reviews. I'm going to leave this as Incomplete until we get more information, in the shape of:

- what version of Nova can this be reproduced for?
- more detailed reproducer steps - perhaps a specific storage configuration is needed.

Changed in nova:
status: New → Incomplete
Wendy Mitchell (wmitchellwr) wrote :

This had failed on hardware lab remote storage system ip 20-27 in the following testcases on the following loads.

FAIL test_cold_migrate_vm[remote-0-0-None-2-volume-confirm]
FAIL test_cold_migrate_vm[remote-1-0-None-1-volume-confirm]
FAIL test_cold_migrate_vm[remote-1-512-None-1-image-confirm]
FAIL test_cold_migrate_vm[remote-0-0-None-2-image_with_vol-confirm]

Lab: IP_20_27
Load: 20190410T013000Z
Job: STX_build_master_master
Build Server: starlingx_mirror
Node Config: 2+4+2
Software Version: 19.01

Lab: IP_20_27
Load: 20190427T013000Z
Job: STX_build_master_master
Build Server: starlingx_mirror
Node Config: 2+4+2
Software Version: 19.01

Wendy Mitchell (wmitchellwr) wrote :

Fails in the check to see that the source host no longer has instance files after the cold migration

Hosts have the following labels assigned ie. includes remote label
 openstack-compute-node=enabled
 openvswitch=enabled
        sriov=enabled
        remote-storage=enabled

Recent load on 2+3 lab also retested. These test continue to fail for the same reason

BUILD_TYPE="Formal"
BUILD_ID="20190613T013000Z"

test_cold_migrate_vm[remote-0-0-None-2-volume-confirm]
test_cold_migrate_vm[remote-1-0-None-1-volume-confirm]
test_cold_migrate_vm[remote-1-512-None-1-image-confirm]
test_cold_migrate_vm[remote-0-0-None-2-image_with_vol-confirm]

Changed in nova:
status: Incomplete → Invalid
status: Invalid → Confirmed
Wendy Mitchell (wmitchellwr) wrote :

For example
# stat /var/lib/nova/instances/4a6f01a6-d9be-4fab-8524-da901343f39f
  File: ‘/var/lib/nova/instances/4a6f01a6-d9be-4fab-8524-da901343f39f’
  Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: fd00h/64768d Inode: 13762561 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2019-06-17 19:08:52.056799204 +0000
Modify: 2019-06-17 18:25:43.886821025 +0000
Change: 2019-06-17 18:25:43.886821025 +0000
 Birth: -

Chris Friesen (cbf123) wrote :

It's not explicitly specified above, but the fault scenario involves compute nodes using RBD storage as opposed to local storage.

Chris Friesen (cbf123) wrote :

I reproduced it with StarlingX. After the initial cold migration we're left with this on the original compute node:

compute-1:~$ ls -l /var/lib/nova/instances/
total 36
drwxr-xr-x 2 root root 4096 Jun 19 23:15 _base
-rw-r--r-- 1 root root 32 Jun 19 22:51 compute_nodes
drwxr-xr-x 2 root root 4096 Jun 19 23:19 fba5ba3d-c515-4b5f-aecd-8b9137dcddae
drwxr-xr-x 2 root root 4096 Jun 19 23:15 fba5ba3d-c515-4b5f-aecd-8b9137dcddae_resize
drwxr-xr-x 2 root root 4096 Jun 19 23:15 locks
drwx------ 2 root root 16384 Jun 19 15:08 lost+found

compute-1:/var/lib/nova/instances$ ls -l fba5ba3d-c515-4b5f-aecd-8b9137dcddae
total 0
compute-1:/var/lib/nova/instances$ ls -l fba5ba3d-c515-4b5f-aecd-8b9137dcddae_resize
total 96

After confirming the migration, the "_resize" directory was deleted.

Looking at the timestamps, it appears that something is re-creating the instance directory after it's been renamed.

Chris Friesen (cbf123) wrote :

On each compute node, nova.conf has the following set under the libvirt section:

images_rbd_ceph_conf=/etc/ceph/ceph.conf, images_rbd_pool=ephemeral, images_type=rbd

Ghada Khalil (gkhalil) on 2019-06-21
Changed in starlingx:
status: Incomplete → Confirmed
assignee: Wendy Mitchell (wmitchellwr) → Bruce Jones (brucej)
Ghada Khalil (gkhalil) wrote :

Setting status to "Confirmed" as per notes from Chris Friesen who is able to reproduce the issue in starlingx. I believe the next step is for the nova team to attempt to reproduce with the specific config above.

Changed in starlingx:
importance: Undecided → Medium
Ghada Khalil (gkhalil) wrote :

Assinging to Yong Hu to follow-up since Bruce is out.

Changed in starlingx:
assignee: Bruce Jones (brucej) → yong hu (yhu6)
tags: added: stx.regression
Changed in starlingx:
assignee: yong hu (yhu6) → chen haochuan (martin1982)
yong hu (yhu6) wrote :

@Martin, which upstream LP was this issue duplicated to?
Or what's the analysis to prove this issue is a general Nova issue instead of a STX issue.

chen haochuan (martin1982) wrote :

Hi Chris

Have you migrate successfully?

I migrate failed, with such error log in libvirtd
2019-07-16 03:07:55.933+0000: 186884: error : qemuMonitorIO:718 : internal error: End of file from qemu monitor
2019-07-16 03:12:15.290+0000: 186902: warning : qemuDomainObjTaint:7640 : Domain id=6 name='instance-00000009' uuid=c1585fd0-22f9-4b91-ba5e-451cd3a5a6ca is tainted: high-privileges
2019-07-16 03:12:15.667+0000: 186904: error : qemuDomainAgentAvailable:9272 : argument unsupported: QEMU guest agent is not configured

controller-1:/var/lib/nova/instances$ ls
_base c1585fd0-22f9-4b91-ba5e-451cd3a5a6ca compute_nodes locks lost+found
controller-1:/var/lib/nova/instances$

controller-0:~$ ls /var/lib/nova/instances/
_base c1585fd0-22f9-4b91-ba5e-451cd3a5a6ca_resize compute_nodes locks lost+found
controller-0:~$

Wendy Mitchell (wmitchellwr) wrote :

Still failures in regression
Lab: WCP_113_121
Load: 20190713T013000Z
AssertionError: Instance files found on previous host
FAIL 20190713 16:29:06 test_cold_migrate_vm[remote-0-0-None-2-volume-confirm]
FAIL 20190713 16:31:38 test_cold_migrate_vm[remote-1-0-None-1-volume-confirm]
FAIL 20190713 16:34:23 test_cold_migrate_vm[remote-1-512-None-1-image-confirm]

chen haochuan (martin1982) wrote :

My reproduce step

1, on active controller
$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server create --image 9da41b71-075a-4f28-990c-6fcd570747cd --flavor 39e233dd-acd4-4dfa-bb85-e9f520792ba3 --nic net-id=external-net0 vm0

$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server stop 7dd1d8ef-a473-42fd-be61-e2f859f0656a

2, from compute-0 copy folder to /var/lib/nova/instance/<instance uuid>/ compute-1 with same location

3, update database
$ kubectl exec -n openstack -it mariadb-server-0 -- grep password /etc/mysql/admin_user.cnf
to get password
$ kubectl exec -n openstack -it mariadb-server-0 bash
in mariadb container
$ mysql -u root -p <last got mariadb password>
    MariaDB [(none)]> use nova;
    MariaDB [nova]> update instances set host='compute-1', node='compute-1' where uuid='7dd1d8ef-a473-42fd-be61-e2f859f0656a
';
    MariaDB [(none)]> quit;

4, launch vm
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server start 7dd1d8ef-a473-42fd-be61-e2f859f0656a

After above stop check nova-compute log
$ kubectl -n openstack logs nova-compute-compute-0-75ea0372-c4l2n -c nova-compute

Got such log nova-compute log
2019-07-19 08:02:50.083 2239894 WARNING nova.compute.manager [req-3df620d7-e138-4259-a98f-687e6c71291d - - - - -] While synchronizing instance power states, found 0 instances in the database and 1 instances on the hypervisor.
2019-07-19 08:03:11.962 2239894 WARNING nova.compute.resource_tracker [req-3df620d7-e138-4259-a98f-687e6c71291d - - - - -] Instance 7dd1d8ef-a473-42fd-be61-e2f859f0656a has been moved to another host compute-1(compute-1). There are allocations remaining against the source host that might need to be removed: {u'resources': {u'MEMORY_MB': 512, u'VCPU': 1, u'DISK_GB': 1}}.

Firstly, please check my reproduce step correct or not. If correct, I think left instances remnants is nova expect behavior.

Nova-compute already could detect instance file in /var/lib/nova/<instance UUID>, so it prompt " 1 instances on the hypervisor" and also find no record in database, "0 instances in the database". Study nova/compute/manager.py, in such case, nova-compute just only promote a WARNING log and do nothing, so if think it is expected behavior.

yong hu (yhu6) on 2019-07-19
Changed in starlingx:
assignee: chen haochuan (martin1982) → Shuquan Huang (shuquan)
Chris Friesen (cbf123) wrote :

Chen, can you explain why you manually moved the instance rather than using the nova cold migration? This is causing nova to take very different code paths than would normally be taken on a successful migration.

It is *not* correct behaviour for nova to leave behind files on the source node after a successful cold migration.

My working assumption is that it's something down in the guts of the nova code that behaves differently when configured with "images_type=rbd" that results in these remnants being left behind.

ya.wang (ya.wang) on 2019-07-22
Changed in starlingx:
assignee: Shuquan Huang (shuquan) → ya.wang (ya.wang)
hutianhao27 (hutianhao) on 2019-07-22
Changed in starlingx:
assignee: ya.wang (ya.wang) → hutianhao27 (hutianhao)
yong hu (yhu6) wrote :

change the priority to low.
If upstream fixes this in future, stx will get it as a part of rebase.

Changed in starlingx:
importance: Medium → Low
hutianhao27 (hutianhao) wrote :

I try to reproduced this problem. But there are no files left after cold migration everytime either using horizon or CIL.

Yang Liu (yliu12) wrote :

Still seeing this issue in cold migration tests with remote storage using master load 20190727T013000Z, and this has been 100% reproducible.

@hutianhao27
Could you please confirm nova.conf in nova-compute containers has images_type=rbd?
Also horizon or openstack CLIs are not correct tools to reproduce this issue, as mentioned in Test Steps, you need to look at /var/lib/nova/instances/ directory after login to the original compute to see the remnants.

hutianhao27 (hutianhao) wrote :

I have reproduced this issue, and there are files left after cold migration. I still try to find out why this issue happened.

hutianhao27 (hutianhao) wrote :

There are files left after cold migration every time. But the directory left is empty and has no effect on cold migration. So I think maybe we can delete this empty directory and I'd like to know if this is appropriate。

hutianhao27 (hutianhao) wrote :

If the directory left is empty and has no effect on cold migration, I'm wondering that can we think maybe this is not a bug?

yong hu (yhu6) wrote :

@yang, could you help check if the directory is also empty?

yong hu (yhu6) wrote :

@tianhao, Can you double confirm the directory is empty without any files inside?
If only having the empty directory, maybe nova code just doesn't delete the /var/lib/nova/instances/, because of either the permission issue or forgetting to delete it after code migration.

hutianhao27 (hutianhao) wrote :

Ok, I will confirm the directory again.

Fix proposed to branch: master
Review: https://review.opendev.org/681652

Changed in nova:
assignee: nobody → hutianhao27 (hutianhao)
status: Confirmed → In Progress

Fix proposed to branch: master
Review: https://review.opendev.org/682523

Change abandoned by hutianhao27 (hu.tianhao@99cloud.net) on branch: master
Review: https://review.opendev.org/681652
Reason: revert previous patch

hutianhao27 (hutianhao) wrote :

I have a patch to solve this problem, but it needs someone to review it.(https://review.opendev.org/#/c/682523/)

Changed in nova:
assignee: hutianhao27 (hutianhao) → Lee Yarwood (lyarwood)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers