Ironic

Bug #1611137
Activity log

Activity log for bug #1611137

Date	Who	What changed	Old value	New value	Message
2016-08-08 22:33:27	Mario Villaplana	bug			added bug
2016-08-08 22:34:37	Mario Villaplana	ironic: assignee		Mario Villaplana (mario-villaplana-j)
2016-08-22 17:03:25	Mario Villaplana	description	Currently, there's no way to automatically abort a clean step that's taking longer than expected. This has a number of problems: * Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135) * There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have. * The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours. One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing.	Currently, there's no way to automatically abort a clean step that's taking longer than expected. This has a number of problems: * Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135) * There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have. * The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours. One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing. As described above, an optional "timeout" field will be added to each clean step specification and indicate the seconds for which a clean step will timeout. For example, this would be the clean step definition for erase_disks if a timeout of 2400 seconds were added: ``` { 'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'timeout': 2400 } ``` The conductor will keep track of this timeout and place the node in a CLEANFAIL provision state if the timeout is reached. Another option is having the agent itself keep track of the timeout. This would allow easier differentiation between cases where the ramdisk doesn't boot and an actual clean step timeout.
2016-08-23 15:35:16	Dmitry Tantsur	ironic: status	New	Confirmed
2016-08-23 15:35:21	Dmitry Tantsur	ironic: importance	Undecided	Wishlist
2016-08-23 15:52:13	Mario Villaplana	description	Currently, there's no way to automatically abort a clean step that's taking longer than expected. This has a number of problems: * Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135) * There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have. * The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours. One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing. As described above, an optional "timeout" field will be added to each clean step specification and indicate the seconds for which a clean step will timeout. For example, this would be the clean step definition for erase_disks if a timeout of 2400 seconds were added: ``` { 'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'timeout': 2400 } ``` The conductor will keep track of this timeout and place the node in a CLEANFAIL provision state if the timeout is reached. Another option is having the agent itself keep track of the timeout. This would allow easier differentiation between cases where the ramdisk doesn't boot and an actual clean step timeout.	Currently, there's no way to automatically abort a clean step that's taking longer than expected. This has a number of problems: * Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135) * There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have. * The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours. One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing. As described above, an optional "timeout" field will be added to each clean step specification and indicate the seconds for which a clean step will timeout. For example, this would be the clean step definition for erase_disks if a timeout of 2400 seconds were added: ``` { 'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'timeout': 2400 } ``` The conductor will keep track of this timeout and place the node in a CLEANFAIL provision state if the timeout is reached. If no timeout is specified, this means that no timeout will be enforced. Another option is having the agent itself keep track of the timeout. This would allow easier differentiation between cases where the ramdisk doesn't boot and an actual clean step timeout.
2016-08-23 15:57:50	Jay Faulkner	tags		rfe-approved
2016-10-28 20:57:20	OpenStack Infra	ironic: status	Confirmed	In Progress
2023-11-19 15:09:00	Dmitry Tantsur	ironic: status	In Progress	Triaged
2023-11-19 15:09:03	Dmitry Tantsur	ironic: assignee	Mario Villaplana (mario-villaplana-j)