nova-compute should stop handling virt lifecycle events when it's shutting down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
Matt Riedemann | ||
Juno |
Fix Released
|
Medium
|
Vladik Romanovsky | ||
Kilo |
Fix Released
|
Medium
|
Matt Riedemann |
Bug Description
This is a follow on to bug 1293480 and related to bug 1408176 and bug 1443186.
There can be a race when rebooting a compute host where libvirt is shutting down guest VMs and sending STOPPED lifecycle events up to nova compute which then tries to stop them via the stop API, which sometimes works and sometimes doesn't - the compute service can go down with a vm_state of ACTIVE and task_state of powering-off which isn't resolve on host reboot.
Sometimes the stop API completes and the instance is stopped with power_state=4 (shutdown) in the nova database. When the host comes back up and libvirt restarts, it starts up the guest VMs which sends the STARTED lifecycle event and nova handles that but because the vm_state in the nova database is STOPPED and the power_state is 1 (running) from the hypervisor, nova things it started up unexpectedly and stops it:
http://
So nova shuts the running guest down.
Actually the block in:
http://
conflicts with the statement in power_state.py:
http://
"The hypervisor is always considered the authority on the status
of a particular VM, and the power_state in the DB should be viewed as a
snapshot of the VMs's state in the (recent) past."
Anyway, that's a different issue but the point is when nova-compute is shutting down it should stop accepting lifecycle events from the hypervisor (virt driver code) since it can't really reliably act on them anyway - we can leave any sync up that needs to happen in init_host() in the compute manager.
Changed in nova: | |
status: | New → Triaged |
importance: | Undecided → Medium |
assignee: | nobody → Matt Riedemann (mriedem) |
tags: | added: kilo-backport-potential |
tags: | added: kilo-rc-potential |
tags: | removed: kilo-backport-potential kilo-rc-potential |
Changed in nova: | |
milestone: | none → liberty-1 |
status: | Fix Committed → Fix Released |
Changed in nova: | |
milestone: | liberty-1 → 12.0.0 |
Changed in nova: | |
assignee: | Marian Horban (mhorban) → nobody |
status: | In Progress → Confirmed |
Changed in nova: | |
status: | Confirmed → Fix Released |
assignee: | nobody → Matt Riedemann (mriedem) |
Attaching some logs sent from someone with IBM that recreated this on Juno with a debug patch (https:/ /review. openstack. org/#/c/ 169782/) for logging:
Hi, I finished another round of testing, this time all the VMs were in SHUTOFF state after hypervisor reboot (), here are the key time points in the log file: kvm_reboot. 2.log.zip
13:41:47 Triggered hypervisor reboot, "Emitting event" arrived
13:45:33 Nova compute server started after hypervisor started up
13:46:25 Finished VM state sync up
For more details please check the attached log file: compute_
Thanks!
======= ======= ======= ======= ==== Before host reboot: ======= ======= ======= ======= = 01ccz023 ˜]# date 01ccz023 ˜]# virsh list ------- ------- ------- ------- ------- ------- ---
================= on kvm001 node Before KVM Reboot ===================
[root@hkg02kvm0
Wed Apr 15 13:39:32 UTC 2015
[root@hkg02kvm0
Id Name State
-------
3 instance-000000a2 running
4 instance-00000058 running
================= on controller node Before KVM Reboot =================== 01ccz023 ˜]# date 01ccz023 ˜]# nova list ------- ------- ------- ------- ----+-- -----+- ------- -+----- ------- +------ ------- +------ ------- ------- ------- ------- -----+ ------- ------- ------- ------- ----+-- -----+- ------- -+----- ------- +------ ------- +------ ------- ------- ------- ------- -----+ 1e19-4a89- 8648-1373b4e29e 6a | zy001 | ACTIVE | - | Running | Shared- Custom- Network1= 192.168. 100.18 | bb42-4eb7- bfca-eca1686f73 5b | zy002 | SHUTOFF | - | Shutdown | Shared- Custom- Network1= 192.168. 100.19 | 6ef0-4e98- 884d-fb4cfda140 a3 | zy004 | SHUTOFF | - | Shutdown | Shared- Custom- Network1= 192.168. 100.21 | fcb2-4e42- 8b83-fcb8bdf519 e6 | zy005 | ACTIVE | - | Running | Shared- Custom- Network1= 192.168. 100.25 | ------- ------- ------- ------- ----+-- -----+- ------- -+----- ------- +------ ------- +------ ------- ------- ------- ------- -----+
[root@hkg02ops0
Wed Apr 15 13:39:52 UTC 2015
[root@hkg02ops0
+------
| ID | Name | Status | Task State | Power State | Networks |
+------
| e53dcdcd-
| 3bcdec02-
| e0638150-
| 793cd8ba-
+------
======= ======= ======= ======= ==== After host reboot: ======= ======= ======= ======= = 01ccz023 ˜]# date 01ccz023 ˜]# virsh list ------- ------- ------- ------- ------- ------- ---
================= on kvm001 After KVM Reboot ===================
[root@hkg02kvm0
Wed Apr 15 13:47:46 UTC 2015
[root@hkg02kvm0
Id Name State
-------
[root@hkg02kvm0 01ccz023 ˜]# nova list ------- ------- ------- ------- ----+-- -----+- ------- -+----- ------- +------ ------- +------ ------- ------- ------- ------- -----+ ------- ------- ------- ------- ----+-- -----+- ------- -+----- ------- +------ ------- +------ ------- ------- ------- ------- -----+ 1e19-4a89- 8648-1373b4e29e 6a | zy001 | SHUTOFF | - | Shutdown | Shared- Custom- Network1= 192.168. 100.18 | bb42-4eb7- bfca-eca1686f73 5b | zy002 | SHUTOFF | - | Shutdown | Shared- Custom- Network1= 192.168. 100.19. ..
+------
| ID | Name | Status | Task State | Power State | Networks |
+------
| e53dcdcd-
| 3bcdec02-