ironic-python-agent

IPA device rescanning can break test runs

Bug #2066308 reported by Julia Kreger on 2024-05-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	ironic-python-agent	New	Undecided	Unassigned

Bug Description

We see occasional failures, mostly on older branches, of "ironic_tempest_plugin.tests.scenario.ironic_standalone.test_cleaning.SoftwareRaidDirect" where the node is stuck in "wait call-back" state.

Upon looking in the node console logs, we can see the agent has been heart-beating.

We can also see where Ironic asks the agent for it's deploy steps:

Example:

2024-05-21 17:00:20.169 587 DEBUG eventlet.wsgi.server [-] (587) accepted ('::ffff:172.24.5.1', 59138, 0, 0) server /usr/local/lib/python3.9/site-packages/eventlet/wsgi.py:1037[00m
2024-05-21 17:00:20.346 587 DEBUG root [-] Executing command: deploy.get_deploy_steps with args: [trim]
2024-05-21 17:00:20.369 587 DEBUG root [-] Getting deploy steps, called with node: {'id': 1, [trim]
2024-05-21 17:00:20.392 587 DEBUG root [-] Deploy steps before deduplication: {'GenericHardwareManager': [trim]

Which looks good, and then suddenly we "iscsistart -f" gets triggered again.

2024-05-21 17:00:20.421 587 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): iscsistart -f execute /usr/local/lib/python3.9/site-packages/oslo_concurrency/processutils.py:390[00m
2024-05-21 17:00:20.629 587 DEBUG oslo_concurrency.processutils [-] 'iscsistart -f' failed. Not Retrying. execute /usr/local/lib/python3.9/site-packages/oslo_concurrency/processutils.py:479[00m
2024-05-21 17:00:20.641 587 DEBUG ironic_lib.utils [-] Command not found: "iscsistart -f" execute /usr/local/lib/python3.9/site-packages/ironic_lib/utils.py:106[00m
2024-05-21 17:00:20.652 587 DEBUG root [-] No iscsi connection detected. Skipping iscsi. Error: [Errno 2] No such file or directory: 'iscsistart' _check_for_iscsi /usr/local/lib/python3.9/site-packages/ironic_python_agent/hardware.py:176[00m

Because the ramdisk has triggered a rescan of devices to provide the most up to date information. It sort of makes sense, actually.

Where things go sideways is if we are looking for a root device hint:

2024-05-21 17:00:21.991 587 DEBUG root [-] Looking for a device matching root hints {'name': '/dev/md0'} get_os_install_device /usr/local/lib/python3.9/site-packages/ironic_python_agent/hardware.py:1630[00m
2024-05-21 17:00:22.000 587 DEBUG ironic_python_agent.disk_utils [-] Waiting until udev event queue is empty udev_settle /usr/local/lib/python3.9/site-packages/ironic_python_agent/disk_utils.py:568[00m

And then here is where things start going sideways:

2024-05-21 17:00:24.948 587 WARNING ironic_lib.utils [-] No device found that matches the root device hints {'name': '/dev/md0'}: StopIteration[00m
2024-05-21 17:00:24.956 587 DEBUG root [-] Still waiting for the root device to appear, attempt 1 of 10 wait_for_disks /usr/local/lib/python3.9/site-packages/ironic_python_agent/hardware.py:1053[00m

Note, 1 of 10.

That is startup code intended for ramdisk boot...

2024-05-21 17:01:20.191 587 WARNING ironic_lib.utils [-] No device found that matches the root device hints {'name': '/dev/md0'}: StopIteration[00m
2024-05-21 17:01:20.204 587 DEBUG root [-] Still waiting for the root device to appear, attempt 10 of 10 wait_for_disks /usr/local/lib/python3.9/site-packages/ironic_python_agent/hardware.py:1053[00m
2024-05-21 17:01:20.209 587 WARNING root [-] The root device was not detected in 27 seconds[00m
2024-05-21 17:01:20.222 587 DEBUG root [-] Returning deploy steps: defaultdict(<class 'list'>, [trim]
2024-05-21 17:01:20.251 587 INFO root [-] Synchronous command get_deploy_steps completed: {'deploy_steps': [trim]
2024-05-21 17:01:20.311 587 INFO eventlet.wsgi.server [-] ::ffff:172.24.5.1 "POST /v1/commands/?wait=true&agent_token=HdH2pwYptfEFoi_cvIPyttHxC1ZhscZDjdM2Li9owjI HTTP/1.1" status: 200 len: 1900 time: 60.0452881[00m

And so we can see the agent finally gets it's act together just after an entire minute. Of course, some of this is variable based upon performance, but essentially we're doing work we really don't need to be doing to assure everything is okay on a code path intended for boot. Because in this case, that it took greater than sixty seconds, and the conductor held the lock the entire time, things really just didn't move forward because heartbeats when locks are held are also not acted upon.

So what is the actual issue:

While the command is generated on the extension:

https://github.com/openstack/ironic-python-agent/blob/master/ironic_python_agent/extensions/deploy.py#L25C9-L25C25

We call cache_node:

https://github.com/openstack/ironic-python-agent/blob/master/ironic_python_agent/hardware.py#L3418-L3438

However, we don't proceed past line 3433. The node is not new nor changed as far as we're aware so we don't trigger down that path.

so eventually we get steps which is a static call on all default hardware managers because their responses are static by default, and then we trigger de-duplication at https://github.com/openstack/ironic-python-agent/blob/master/ironic_python_agent/extensions/deploy.py#L42C33-L42C50

And inside de-duplication, the very first action we take calls 'evaluate_hardware_support':

https://github.com/openstack/ironic-python-agent/blob/master/ironic_python_agent/hardware.py#L3481-L3505

And looking at 'evaluate_hardware_support':

https://github.com/openstack/ironic-python-agent/blob/master/ironic_python_agent/hardware.py#L1304-L1314

_check_for_iscsi, ipmi module loading, even md device scanning and assembly all take place before triggering self.wait_for_disks.

So it seems there are a couple different things which need to happen here.

1) We can likely just skip loading iscsi ibft and ipmi stuffs after the first initialization, and stop spamming the logs. We already do something similar with Multipath. The multi-device scan and assemble is likely okay to keep in this logic because it is attempting to handle devices which have changed or not.
2) We likely need to also short-circuit the wait_for_disks logic so we don't try to scan/determine multiple times. Letting it run once would at least keep the cleanup from being breaking in terms of behavior pattern and reaching the desired state eventually. One aspect which is important is the logic in get_os_install_device would also eventually determine and cache the desired device, so we don't want to forever prevent it in lower level logic from being able to find that device.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-06-06: Fix included in openstack/ironic-python-agent 9.12.0

This issue was fixed in the openstack/ironic-python-agent 9.12.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.