IPA device rescanning can break test runs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ironic-python-agent |
New
|
Undecided
|
Unassigned |
Bug Description
We see occasional failures, mostly on older branches, of "ironic_
Upon looking in the node console logs, we can see the agent has been heart-beating.
We can also see where Ironic asks the agent for it's deploy steps:
Example:
2024-05-21 17:00:20.169 587 DEBUG eventlet.
2024-05-21 17:00:20.346 587 DEBUG root [-] Executing command: deploy.
2024-05-21 17:00:20.369 587 DEBUG root [-] Getting deploy steps, called with node: {'id': 1, [trim]
2024-05-21 17:00:20.392 587 DEBUG root [-] Deploy steps before deduplication: {'GenericHardwa
Which looks good, and then suddenly we "iscsistart -f" gets triggered again.
2024-05-21 17:00:20.421 587 DEBUG oslo_concurrenc
2024-05-21 17:00:20.629 587 DEBUG oslo_concurrenc
2024-05-21 17:00:20.641 587 DEBUG ironic_lib.utils [-] Command not found: "iscsistart -f" execute /usr/local/
2024-05-21 17:00:20.652 587 DEBUG root [-] No iscsi connection detected. Skipping iscsi. Error: [Errno 2] No such file or directory: 'iscsistart' _check_for_iscsi /usr/local/
Because the ramdisk has triggered a rescan of devices to provide the most up to date information. It sort of makes sense, actually.
Where things go sideways is if we are looking for a root device hint:
2024-05-21 17:00:21.991 587 DEBUG root [-] Looking for a device matching root hints {'name': '/dev/md0'} get_os_
2024-05-21 17:00:22.000 587 DEBUG ironic_
And then here is where things start going sideways:
2024-05-21 17:00:24.948 587 WARNING ironic_lib.utils [-] No device found that matches the root device hints {'name': '/dev/md0'}: StopIteration[00m
2024-05-21 17:00:24.956 587 DEBUG root [-] Still waiting for the root device to appear, attempt 1 of 10 wait_for_disks /usr/local/
Note, 1 of 10.
That is startup code intended for ramdisk boot...
2024-05-21 17:01:20.191 587 WARNING ironic_lib.utils [-] No device found that matches the root device hints {'name': '/dev/md0'}: StopIteration[00m
2024-05-21 17:01:20.204 587 DEBUG root [-] Still waiting for the root device to appear, attempt 10 of 10 wait_for_disks /usr/local/
2024-05-21 17:01:20.209 587 WARNING root [-] The root device was not detected in 27 seconds[00m
2024-05-21 17:01:20.222 587 DEBUG root [-] Returning deploy steps: defaultdict(<class 'list'>, [trim]
2024-05-21 17:01:20.251 587 INFO root [-] Synchronous command get_deploy_steps completed: {'deploy_steps': [trim]
2024-05-21 17:01:20.311 587 INFO eventlet.
And so we can see the agent finally gets it's act together just after an entire minute. Of course, some of this is variable based upon performance, but essentially we're doing work we really don't need to be doing to assure everything is okay on a code path intended for boot. Because in this case, that it took greater than sixty seconds, and the conductor held the lock the entire time, things really just didn't move forward because heartbeats when locks are held are also not acted upon.
So what is the actual issue:
While the command is generated on the extension:
We call cache_node:
However, we don't proceed past line 3433. The node is not new nor changed as far as we're aware so we don't trigger down that path.
so eventually we get steps which is a static call on all default hardware managers because their responses are static by default, and then we trigger de-duplication at https:/
And inside de-duplication, the very first action we take calls 'evaluate_
And looking at 'evaluate_
_check_for_iscsi, ipmi module loading, even md device scanning and assembly all take place before triggering self.wait_
So it seems there are a couple different things which need to happen here.
1) We can likely just skip loading iscsi ibft and ipmi stuffs after the first initialization, and stop spamming the logs. We already do something similar with Multipath. The multi-device scan and assemble is likely okay to keep in this logic because it is attempting to handle devices which have changed or not.
2) We likely need to also short-circuit the wait_for_disks logic so we don't try to scan/determine multiple times. Letting it run once would at least keep the cleanup from being breaking in terms of behavior pattern and reaching the desired state eventually. One aspect which is important is the logic in get_os_
This issue was fixed in the openstack/ ironic- python- agent 9.12.0 release.