Activity log for bug #1611137

Date Who What changed Old value New value Message
2016-08-08 22:33:27 Mario Villaplana bug added bug
2016-08-08 22:34:37 Mario Villaplana ironic: assignee Mario Villaplana (mario-villaplana-j)
2016-08-22 17:03:25 Mario Villaplana description Currently, there's no way to automatically abort a clean step that's taking longer than expected. This has a number of problems: * Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135) * There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have. * The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours. One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing. Currently, there's no way to automatically abort a clean step that's taking longer than expected. This has a number of problems: * Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135) * There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have. * The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours. One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing. As described above, an optional "timeout" field will be added to each clean step specification and indicate the seconds for which a clean step will timeout. For example, this would be the clean step definition for erase_disks if a timeout of 2400 seconds were added: ``` { 'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'timeout': 2400 } ``` The conductor will keep track of this timeout and place the node in a CLEANFAIL provision state if the timeout is reached. Another option is having the agent itself keep track of the timeout. This would allow easier differentiation between cases where the ramdisk doesn't boot and an actual clean step timeout.
2016-08-23 15:35:16 Dmitry Tantsur ironic: status New Confirmed
2016-08-23 15:35:21 Dmitry Tantsur ironic: importance Undecided Wishlist
2016-08-23 15:52:13 Mario Villaplana description Currently, there's no way to automatically abort a clean step that's taking longer than expected. This has a number of problems: * Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135) * There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have. * The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours. One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing. As described above, an optional "timeout" field will be added to each clean step specification and indicate the seconds for which a clean step will timeout. For example, this would be the clean step definition for erase_disks if a timeout of 2400 seconds were added: ``` { 'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'timeout': 2400 } ``` The conductor will keep track of this timeout and place the node in a CLEANFAIL provision state if the timeout is reached. Another option is having the agent itself keep track of the timeout. This would allow easier differentiation between cases where the ramdisk doesn't boot and an actual clean step timeout. Currently, there's no way to automatically abort a clean step that's taking longer than expected. This has a number of problems: * Nodes can get stuck on a particular clean step for long periods of time if they never make it to the "CLEANWAIT" state (see this bug: https://bugs.launchpad.net/ironic/+bug/1611135) * There's no way to automatically detect potential issues with nodes that may be non-fatal. For example, if a disk takes N hours to erase but we only expect it to take N-3, the clean step may succeed, but the disk might have degraded performance that is unacceptable for a provisioned node to have. * The clean_callback_timeout, while useful for detecting problems with cleaning after a transition to CLEANWAIT, is too generic. An operator may want cleaning to fail on a fast clean step after 1 minute and on a slow clean step after 24 hours. One potential way to do this is to let hardware manager creators specify a "timeout" field when defining a clean step, similar to how the "abortable" field was added. If a clean step exceeds that value, the node will be placed in CLEANFAIL after the specified amount of time has elapsed without that clean step successfully finishing. As described above, an optional "timeout" field will be added to each clean step specification and indicate the seconds for which a clean step will timeout. For example, this would be the clean step definition for erase_disks if a timeout of 2400 seconds were added: ```             {                 'step': 'erase_devices',                 'priority': 10,                 'interface': 'deploy',                 'reboot_requested': False,                 'abortable': True,                 'timeout': 2400             } ``` The conductor will keep track of this timeout and place the node in a CLEANFAIL provision state if the timeout is reached. If no timeout is specified, this means that no timeout will be enforced. Another option is having the agent itself keep track of the timeout. This would allow easier differentiation between cases where the ramdisk doesn't boot and an actual clean step timeout.
2016-08-23 15:57:50 Jay Faulkner tags rfe-approved
2016-10-28 20:57:20 OpenStack Infra ironic: status Confirmed In Progress
2023-11-19 15:09:00 Dmitry Tantsur ironic: status In Progress Triaged
2023-11-19 15:09:03 Dmitry Tantsur ironic: assignee Mario Villaplana (mario-villaplana-j)