automatic operations can lead to nodes entering maintenance mode
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic |
Triaged
|
Low
|
Unassigned |
Bug Description
e.g. I did nova boot and my node went into maintenance. However maintenance mode is an operator facility, which users of Nova can't access - so if this happens when someone uses the Nova API their node gets stuck and they can't fix it. E.g. imagine 'nova reboot INSTANCE' -> instance wedged.
Perhaps this is by design, but it seems most unfortunate to me.
ironic node-show 3f9df53e-
+------
| Property | Value |
+------
| chassis_uuid | None |
| console_enabled | False |
| created_at | 2014-06-
| driver | pxe_ipmitool |
| driver_info | {u'pxe_
| | u'pxe_ramdisk': u'481eb2db-
| | u'pxe_image_
| | u'pxe_root_gb': u'10', u'pxe_ephemeral
| | u'ipmi_username': u'Administrator', u'ipmi_address': u'x.x.x.x', |
| | u'pxe_kernel': u'eabb3e17-
| | u'ipmi_password': u'xxxx', u'pxe_deploy_key': |
| | u'S15XQHTGQHKMD
| | u'1d19c419-
| | u'pxe_ephemeral
| extra | {} |
| instance_uuid | 9df33a91-
| last_error | During sync_power_state, max retries exceeded for node 3f9df53e- |
| | 5a76-47b3-
| | state 'power on'. Updating DB state to 'None' Switching node to |
| | maintenance mode. |
| maintenance | True |
| power_state | None |
| properties | {u'memory_mb': u'98304', u'cpu_arch': u'amd64', u'local_gb': u'1600', |
| | u'cpus': u'24'} |
| provision_state | deploy failed |
| reservation | None |
| target_power_state | None |
| target_
| updated_at | 2014-06-
| uuid | 3f9df53e-
+------
Changed in ironic: | |
status: | New → Triaged |
importance: | Undecided → High |
summary: |
- nova operations can lead to nodes entering maintenance mode + automatic operations can lead to nodes entering maintenance mode |
Well, what caused maintenance mode in your case? Because automatic maintenance mode happens, when coductor is unable to access node: network down, wrong SSH driver configuration, node disappeared etc. Of these 3 only the 1st is transient and might be recovered w/o operator intervention.
What exactly do you suggest to improve?