automatic operations can lead to nodes entering maintenance mode

Bug #1326279 reported by Robert Collins
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Triaged
Low
Unassigned

Bug Description

e.g. I did nova boot and my node went into maintenance. However maintenance mode is an operator facility, which users of Nova can't access - so if this happens when someone uses the Nova API their node gets stuck and they can't fix it. E.g. imagine 'nova reboot INSTANCE' -> instance wedged.

Perhaps this is by design, but it seems most unfortunate to me.

ironic node-show 3f9df53e-5a76-47b3-a1c4-f596822bb43d
+------------------------+-----------------------------------------------------------------------+
| Property | Value |
+------------------------+-----------------------------------------------------------------------+
| chassis_uuid | None |
| console_enabled | False |
| created_at | 2014-06-04T04:18:43+00:00 |
| driver | pxe_ipmitool |
| driver_info | {u'pxe_deploy_ramdisk': u'3e7b3397-77fa-4d06-99bc-f02507b79e75', |
| | u'pxe_ramdisk': u'481eb2db-caf3-41df-912c-81f905c68689', |
| | u'pxe_image_source': u'd8089273-291b-406b-bad7-e17555b64765', |
| | u'pxe_root_gb': u'10', u'pxe_ephemeral_format': u'ext4', |
| | u'ipmi_username': u'Administrator', u'ipmi_address': u'x.x.x.x', |
| | u'pxe_kernel': u'eabb3e17-a9b1-420a-9bad-090b1f50df71', |
| | u'ipmi_password': u'xxxx', u'pxe_deploy_key': |
| | u'S15XQHTGQHKMDB5XA4RZZS4UHPA2MDJW', u'pxe_deploy_kernel': |
| | u'1d19c419-73b4-4e6b-90f5-b084e1c17f5a', u'pxe_swap_mb': u'0', |
| | u'pxe_ephemeral_gb': u'1590'} |
| extra | {} |
| instance_uuid | 9df33a91-4be3-429c-a712-35cd60a1e101 |
| last_error | During sync_power_state, max retries exceeded for node 3f9df53e- |
| | 5a76-47b3-a1c4-f596822bb43d, node state None does not match expected |
| | state 'power on'. Updating DB state to 'None' Switching node to |
| | maintenance mode. |
| maintenance | True |
| power_state | None |
| properties | {u'memory_mb': u'98304', u'cpu_arch': u'amd64', u'local_gb': u'1600', |
| | u'cpus': u'24'} |
| provision_state | deploy failed |
| reservation | None |
| target_power_state | None |
| target_provision_state | None |
| updated_at | 2014-06-04T07:11:40+00:00 |
| uuid | 3f9df53e-5a76-47b3-a1c4-f596822bb43d |
+------------------------+-----------------------------------------------------------------------+

Revision history for this message
Dmitry Tantsur (divius) wrote :

Well, what caused maintenance mode in your case? Because automatic maintenance mode happens, when coductor is unable to access node: network down, wrong SSH driver configuration, node disappeared etc. Of these 3 only the 1st is transient and might be recovered w/o operator intervention.

What exactly do you suggest to improve?

Changed in ironic:
status: New → Incomplete
Revision history for this message
Robert Collins (lifeless) wrote :

Oh, adding insult to injury, this failure mode leaves the instance stuck in deleting:
| 9df33a91-4be3-429c-a712-35cd60a1e101 | hw-test-9df33a91-4be3-429c-a712-35cd60a1e101 | ERROR | deleting | NOSTATE | ctlplane=10.10.16.147 |

Revision history for this message
aeva black (tenbrae) wrote :

Based on the provided status output, this error is not the result of a Nova operation --

| last_error | During sync_power_state, max retries exceeded for node

The Nova operation happened to coincide with this node becoming inaccessible by the IPMITool Power Driver, and during a periodic sync_power_state poll, after the conductor failed to determine the node's power state $max_retries times consecutively, Ironic removed that node from service. Any instance state is preserved by this action, so that Ironic can attempt to resume the prior operation once the operator restores connectivity to the node.

Alternatively, the operator may delete the node, which should remove the "error:deleting" instance from Nova as well.

The underlying failure is not presented here -- perhaps the networking was inaccessible, or the BMC crashed, I can't tell.

If this happens during a "nova boot", the user of nova should re-issue their request, and the nova scheduler will find another (non-maintenance) node to deploy it to. The purpose of automatically moving a node to maintenance mode under certain failure conditions is to prevent further Nova failures which would otherwise occur when attempting to deploy instances to a node that is physically not manageable by Ironic.

Revision history for this message
Robert Collins (lifeless) wrote :

So, there's definitely an issue though with the nova instance being undeletable - that prevents the user freeing up quota space. I accept that the node power driver failure wasn't user initiated, but I worry about e.g. that happening to a deployed instance. Seems like transient failures shouldn't make nodes inoperable // trigger admin intervention.

e.g. I think I'm saying there is a design bug IMO - maintenance mode is an admin tool; 'Ironic cannot talk to a BMC' is a transient state internal to Ironic and not at all the same as maintenance mode.

Revision history for this message
Dmitry Tantsur (divius) wrote :

Ok, I see two potential issues here: 1. transient faults move nodes to maintenance mode (this can be solved in many ways, including retrying after some time); 2. there are conditions, under which nova instance can't be deleted (I've encountered it myself, though not sure whether it's Nova or Ironic bug).

If you agree, could you please seperate this bug into 2 more specific?

Revision history for this message
Robert Collins (lifeless) wrote :

@Dmitry I'll keep this one for the automated entry into maintenance mode, and file a new one about sync issues between nova and Ironic.

@Devananda - I totally agree with Ironic not scheduling stuff onto hardware that it cannot currently talk to, but maintenance mode as something that requires admin intervention to fix, should also require admin choice to enter. E.g. we should separate out 'I am doing maintenance' from 'Ironic cannot talk to the BMC // the deploy disk did not check in in time // ...'

Changed in ironic:
status: Incomplete → New
Dmitry Tantsur (divius)
Changed in ironic:
status: New → Triaged
importance: Undecided → High
aeva black (tenbrae)
summary: - nova operations can lead to nodes entering maintenance mode
+ automatic operations can lead to nodes entering maintenance mode
Revision history for this message
aeva black (tenbrae) wrote :

updated title to reflect discussion that this is really caused by automatic operations, not nova's operation

Changed in ironic:
importance: High → Medium
importance: Medium → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.