OpenStack Compute (nova)

VMware VCDriver: A node crash, vSphere HA and badly timed _sync_power_states() will shut instances down

Bug #1403856 reported by Toni Ylenius on 2014-12-18

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Expired	Undecided	Unassigned

Bug Description

The release: Icehouse, however the code in juno seems to same

When a VMware node crashes, the instances will be restarted on a new node because of vSphere HA.

If _sync_power_states() is triggered just after node crash, the vms are down and Nova will update the database and print
"Instance shutdown by itself. Calling the stop API."

On next _sync_power_states() run, Nova will notice that power state is changed and will shut the instances down and print
"Instance is not stopped. Calling the stop API.". This happens because vSphere HA has started instances meanwhile.

To my understanding to fix this we need either
1. change the logic (I don't have ideas unfortunately) or
2. add a config option that states if we force stop or not when an instance is stopped from the database point of view.

See original description

Tags:

Toni Ylenius (toni-ylenius) on 2014-12-18

description:	updated
description:	updated
description:	updated
description:	updated
description:	updated

Joe Gordon (jogo) on 2015-01-22

tags:	added: vmware
Changed in nova:
status:	New → Confirmed

Revision history for this message

Tracy Jones (tjones-i) wrote on 2015-06-09:

This issue can be avoided by setting sync_power_state_interval=-1 in nova.con

But that stops the periodic task completely and you miss some of the goodness of this task - which is syncing the database state with the hypervisor state.

In sphere if HA is enabled on a cluster we want to just log the discrepancy in power state and vm state, but not do the action of powering down the hosts as it is a transient state why the instances are being migrated to the new host. Unfortunately there is no way at the compute/manager level to know this. Another solution is yet ANOTHER config variable to control the action of either logging the warning or doing the power off

Changed in nova:
assignee:	nobody → Gary Kotton (garyk)
importance:	Undecided → High

Davanum Srinivas (DIMS) (dims-v) on 2016-03-04

Changed in nova:
assignee:	Gary Kotton (garyk) → nobody

Revision history for this message

Giridhar Jayavelu (gjayavelu) wrote on 2016-03-09:

When a node crashes, the vms connection state will be "disconnected" and vm state will be still poweredOn.
When HA restarts the VM on different host, it will be in poweredOff for a transient time.
One possible option could be check if HA is enabled on the cluster and if vm state is poweredOff then retry to check if the state changed on poweredOn.

Revision history for this message

Giridhar Jayavelu (gjayavelu) wrote on 2016-03-10:

we can detect sub state of the vm using 'recentTask' property on the vm.

https://www.vmware.com/support/developer/vc-sdk/visdk2xpubs/ReferenceGuide/vim.VirtualMachine.PowerState.html

"""clients interested in monitoring the status of a virtual machine should typically track the activeTask data object in addition to the powerState object."""

If the recent task id is "vim.VirtualMachine.reset" and status="running", then we can probably wait before returning the state in the driver or call _poll_rebooting_instances() at manager.py

Giridhar Jayavelu (gjayavelu) on 2016-03-15

Changed in nova:
assignee:	nobody → Giridhar Jayavelu (gjayavelu)

Giridhar Jayavelu (gjayavelu) on 2016-04-18

Changed in nova:
assignee:	Giridhar Jayavelu (gjayavelu) → nobody

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2016-07-05: Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance:	High → Undecided
status:	Confirmed → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.