VMware VCDriver: A node crash, vSphere HA and badly timed _sync_power_states() will shut instances down

Bug #1403856 reported by Toni Ylenius
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

The release: Icehouse, however the code in juno seems to same

When a VMware node crashes, the instances will be restarted on a new node because of vSphere HA.

If _sync_power_states() is triggered just after node crash, the vms are down and Nova will update the database and print
"Instance shutdown by itself. Calling the stop API."

On next _sync_power_states() run, Nova will notice that power state is changed and will shut the instances down and print
"Instance is not stopped. Calling the stop API.". This happens because vSphere HA has started instances meanwhile.

To my understanding to fix this we need either
1. change the logic (I don't have ideas unfortunately) or
2. add a config option that states if we force stop or not when an instance is stopped from the database point of view.

Tags: vmware
description: updated
description: updated
description: updated
description: updated
description: updated
Joe Gordon (jogo)
tags: added: vmware
Changed in nova:
status: New → Confirmed
Revision history for this message
Tracy Jones (tjones-i) wrote :

This issue can be avoided by setting sync_power_state_interval=-1 in nova.con

But that stops the periodic task completely and you miss some of the goodness of this task - which is syncing the database state with the hypervisor state.

In sphere if HA is enabled on a cluster we want to just log the discrepancy in power state and vm state, but not do the action of powering down the hosts as it is a transient state why the instances are being migrated to the new host. Unfortunately there is no way at the compute/manager level to know this. Another solution is yet ANOTHER config variable to control the action of either logging the warning or doing the power off

Changed in nova:
assignee: nobody → Gary Kotton (garyk)
importance: Undecided → High
Changed in nova:
assignee: Gary Kotton (garyk) → nobody
Revision history for this message
Giridhar Jayavelu (gjayavelu) wrote :

When a node crashes, the vms connection state will be "disconnected" and vm state will be still poweredOn.
When HA restarts the VM on different host, it will be in poweredOff for a transient time.
One possible option could be check if HA is enabled on the cluster and if vm state is poweredOff then retry to check if the state changed on poweredOn.

Revision history for this message
Giridhar Jayavelu (gjayavelu) wrote :

we can detect sub state of the vm using 'recentTask' property on the vm.

https://www.vmware.com/support/developer/vc-sdk/visdk2xpubs/ReferenceGuide/vim.VirtualMachine.PowerState.html

"""clients interested in monitoring the status of a virtual machine should typically track the activeTask data object in addition to the powerState object."""

If the recent task id is "vim.VirtualMachine.reset" and status="running", then we can probably wait before returning the state in the driver or call _poll_rebooting_instances() at manager.py

Changed in nova:
assignee: nobody → Giridhar Jayavelu (gjayavelu)
Changed in nova:
assignee: Giridhar Jayavelu (gjayavelu) → nobody
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
  Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance: High → Undecided
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.