Ironic

automatic operations can lead to nodes entering maintenance mode

Bug #1326279 reported by Robert Collins on 2014-06-04

This bug report is a duplicate of: Bug #1596107: [RFE] Ironic-set maintenance conditions should be different than operator-set maintenance. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Triaged	Low	Unassigned

Bug Description

e.g. I did nova boot and my node went into maintenance. However maintenance mode is an operator facility, which users of Nova can't access - so if this happens when someone uses the Nova API their node gets stuck and they can't fix it. E.g. imagine 'nova reboot INSTANCE' -> instance wedged.

Perhaps this is by design, but it seems most unfortunate to me.

ironic node-show 3f9df53e-5a76-47b3-a1c4-f596822bb43d
+------------------------+-----------------------------------------------------------------------+
| Property | Value |
+------------------------+-----------------------------------------------------------------------+
| chassis_uuid | None |
| console_enabled | False |
| created_at | 2014-06-04T04:18:43+00:00 |
| driver | pxe_ipmitool |
| driver_info | {u'pxe_deploy_ramdisk': u'3e7b3397-77fa-4d06-99bc-f02507b79e75', |
| | u'pxe_ramdisk': u'481eb2db-caf3-41df-912c-81f905c68689', |
| | u'pxe_image_source': u'd8089273-291b-406b-bad7-e17555b64765', |
| | u'pxe_root_gb': u'10', u'pxe_ephemeral_format': u'ext4', |
| | u'ipmi_username': u'Administrator', u'ipmi_address': u'x.x.x.x', |
| | u'pxe_kernel': u'eabb3e17-a9b1-420a-9bad-090b1f50df71', |
| | u'ipmi_password': u'xxxx', u'pxe_deploy_key': |
| | u'S15XQHTGQHKMDB5XA4RZZS4UHPA2MDJW', u'pxe_deploy_kernel': |
| | u'1d19c419-73b4-4e6b-90f5-b084e1c17f5a', u'pxe_swap_mb': u'0', |
| | u'pxe_ephemeral_gb': u'1590'} |
| extra | {} |
| instance_uuid | 9df33a91-4be3-429c-a712-35cd60a1e101 |
| last_error | During sync_power_state, max retries exceeded for node 3f9df53e- |
| | 5a76-47b3-a1c4-f596822bb43d, node state None does not match expected |
| | state 'power on'. Updating DB state to 'None' Switching node to |
| | maintenance mode. |
| maintenance | True |
| power_state | None |
| properties | {u'memory_mb': u'98304', u'cpu_arch': u'amd64', u'local_gb': u'1600', |
| | u'cpus': u'24'} |
| provision_state | deploy failed |
| reservation | None |
| target_power_state | None |
| target_provision_state | None |
| updated_at | 2014-06-04T07:11:40+00:00 |
| uuid | 3f9df53e-5a76-47b3-a1c4-f596822bb43d |
+------------------------+-----------------------------------------------------------------------+

Revision history for this message

Dmitry Tantsur (divius) wrote on 2014-06-04:

Well, what caused maintenance mode in your case? Because automatic maintenance mode happens, when coductor is unable to access node: network down, wrong SSH driver configuration, node disappeared etc. Of these 3 only the 1st is transient and might be recovered w/o operator intervention.

What exactly do you suggest to improve?

Changed in ironic:
status:	New → Incomplete

Revision history for this message

Robert Collins (lifeless) wrote on 2014-06-04:

Revision history for this message

aeva black (tenbrae) wrote on 2014-06-04:

Based on the provided status output, this error is not the result of a Nova operation --

| last_error | During sync_power_state, max retries exceeded for node

The Nova operation happened to coincide with this node becoming inaccessible by the IPMITool Power Driver, and during a periodic sync_power_state poll, after the conductor failed to determine the node's power state $max_retries times consecutively, Ironic removed that node from service. Any instance state is preserved by this action, so that Ironic can attempt to resume the prior operation once the operator restores connectivity to the node.

Alternatively, the operator may delete the node, which should remove the "error:deleting" instance from Nova as well.

The underlying failure is not presented here -- perhaps the networking was inaccessible, or the BMC crashed, I can't tell.

If this happens during a "nova boot", the user of nova should re-issue their request, and the nova scheduler will find another (non-maintenance) node to deploy it to. The purpose of automatically moving a node to maintenance mode under certain failure conditions is to prevent further Nova failures which would otherwise occur when attempting to deploy instances to a node that is physically not manageable by Ironic.

Revision history for this message

Robert Collins (lifeless) wrote on 2014-06-04:

So, there's definitely an issue though with the nova instance being undeletable - that prevents the user freeing up quota space. I accept that the node power driver failure wasn't user initiated, but I worry about e.g. that happening to a deployed instance. Seems like transient failures shouldn't make nodes inoperable // trigger admin intervention.

e.g. I think I'm saying there is a design bug IMO - maintenance mode is an admin tool; 'Ironic cannot talk to a BMC' is a transient state internal to Ironic and not at all the same as maintenance mode.

Revision history for this message

Dmitry Tantsur (divius) wrote on 2014-06-05:

Ok, I see two potential issues here: 1. transient faults move nodes to maintenance mode (this can be solved in many ways, including retrying after some time); 2. there are conditions, under which nova instance can't be deleted (I've encountered it myself, though not sure whether it's Nova or Ironic bug).

If you agree, could you please seperate this bug into 2 more specific?

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-13:

@Dmitry I'll keep this one for the automated entry into maintenance mode, and file a new one about sync issues between nova and Ironic.

@Devananda - I totally agree with Ironic not scheduling stuff onto hardware that it cannot currently talk to, but maintenance mode as something that requires admin intervention to fix, should also require admin choice to enter. E.g. we should separate out 'I am doing maintenance' from 'Ironic cannot talk to the BMC // the deploy disk did not check in in time // ...'

Changed in ironic:
status:	Incomplete → New

Dmitry Tantsur (divius) on 2014-07-14

Changed in ironic:
status:	New → Triaged
importance:	Undecided → High

aeva black (tenbrae) on 2014-09-13

summary:

- nova operations can lead to nodes entering maintenance mode
+ automatic operations can lead to nodes entering maintenance mode

Revision history for this message

aeva black (tenbrae) wrote on 2014-09-13:

updated title to reflect discussion that this is really caused by automatic operations, not nova's operation

Changed in ironic:
importance:	High → Medium
importance:	Medium → Low

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1596107 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.