ironic-python-agent

[RFE] Feature proposal for "quiet/non-blocking disk cleanup"

Bug #2061362 reported by Adam Rozman on 2024-04-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	ironic-python-agent	Incomplete	Undecided	Adam Rozman

Bug Description

The idea would be to implement a configuration option that would allow IPA to suppress the disk metadata cleanup related errors
to avoid instead of causing a provisioning / cleanup failure.

Instead of causing a failure, IPA would just log the exception with some additional information (when the option is activated), then IPA would stop the cleaning process of the faulty disk and continue the cleanup process with the next disk.

The configuration option could be provided as part of the IPA config file, cmdline argument or a kernel cmdline parameter.

Reasoning:

It could be argued that if there is a disk cleanup failure on a machine then that means that the machine in question is unreliable and should not be used until the disk issue has been sorted out. The issue is that without an option to suppress cleanup errors it becomes harder to handle temporary disk issues and that could hurt business continuity. If one disk has issues that does not automatically means that the disk has any actual effect on the day to day operations of a machine, the disk could be an unused one, or users could just temporally not use the disk.

Example scenario:

Imagine a shared hardware lab that is used by multiple development organizations and one of the organization would like to test a software stack that involves IPA provisioning. The machines in the lab have tens of disks directly attached and there is even a SAN attached to the machines thus from Linux perspective there are 100s of individual disk detected, out of all the disks 1 disk on 1 machine goes bad and causes I/O errors during disk metadata cleanup thus causing delays in the testing/verification of the whole stack thus causing delays in sw releasing and so on...

See original description

Tags:

Revision history for this message

Adam Rozman (rozzix) wrote on 2024-04-15:

I have already implemented the feature and I will push it upstream as soon as possible.

Changed in ironic-python-agent:
assignee:	nobody → Adam Rozman (rozzix)
status:	New → In Progress

Adam Rozman (rozzix) on 2024-04-19

description:

updated

Revision history for this message

Julia Kreger (juliaashleykreger) wrote on 2024-04-27:

I guess this sort of makes sense if your using a device which is locked, such as you hit a shared block device across multiple systems. That seems like the only way to get past such a case.

Julia Kreger (juliaashleykreger) on 2024-04-29

summary:	- Feature proposal for "quiet/non-blocking disk cleanup" + [RFE] Feature proposal for "quiet/non-blocking disk cleanup"
tags:	added: rfe

Revision history for this message

Jay Faulkner (jason-oldos) wrote on 2024-05-07:

After discussion in the IRC meeting yesterday, we were a little concerned about ignoring errors being a first-class method for handing edge case hardware.

Can you provide more detail on the issue you're trying to workaround? We already have specific features for excluding devices from cleaning using root_device_hints or by overriding the erase_devices_metadata command. I am concerned that ignoring failures in these cases could lead to unexpected behavior, such as machines booting into their OS in provisioning networks when/if PXE fails being one.

Several ideas and conjectures about your situation were put in the meeting, I suggest you read the logs: https://meetings.opendev.org/meetings/ironic/2024/ironic.2024-05-06-15.00.log.html#l-57

Getting a better idea of the specific use case here would help us in trying to find a solution that doesn't require us to ignore errors.

Revision history for this message

Adam Rozman (rozzix) wrote on 2024-05-08 (last edit on 2024-05-08):

Thanks for the great feedback, just as a reference for readers, this is the implementation
https://review.opendev.org/c/openstack/ironic-python-agent/+/915825

In my particular case I have indirectly used a set of machines in a lab that were attached to a FCoE SAN but the faulty "disk" was not actually part of the SAN.
The machines had local disks too and they also had strange devices that I haven't managed to figure out where exactly they were coming from but my suspicion is that these were simulated usb devices attached by the bmc:
```
model: SPI Flash LUN
name: /dev/disk/by-path/pci-0000:00:14.0-usb-0:4:1.0-scsi-0:0:0:2
```
They actually had no LUN, WWN or serial numbers btw...

These devices were picked up by the host OS both SLES 15 SP4/SP5 and Cenots 9 stream but the devices caused different issues on the OSes.
On centos the linux kernel was throwing errors related to the devices (dmsg, jorunal) but the IPA skipped over these devices during cleanup, I guess because of the kernel error these devices were not properly presented to IPA via SYSFS but not sure what caused the exclusion exactly. With the same IPA version on SLES the kernel managed to handle the bad disks, there were no kernel errors but accessing such devices caused I/O errors and failed cleanups.

When metadata cleanup was turned off during the deployment IPA only cleaned a single disk designated by the root device hint so that has worked as expected, but my
requirements stated that stakeholder wants cleanup to work for all the disks and if there is a faulty disk, then that should be "ignored/skipped".

I completely understand your view that you feel like in principle if there is a faulty disk then that is a faulty machine, but as I stated in the original issue text this assumption
is not correct in every case (some users just don't care).

Honestly I have not tried overwriting the cleanup step because I am using Ironic via Metal3 and I don't think there is a possibility to overwrite IPA/Ironic steps in Metal3 yet. Or at least I haven't figured out how to do it. So that is why I was looking for a simpler approach that would work for every type of Ironic deployment.

EDITS:
grammar + typos

Thanks for the great feedback, just as a reference for readers, this is the implementation
https://review.opendev.org/c/openstack/ironic-python-agent/+/915825

In my particular case  I have indirectly used a set of machines in a lab that were attached to a FCoE SAN but the faulty "disk" was not actually part of the SAN.
The machines had local disks too and they also had strange devices that I haven't managed to figure out where exactly they were coming from but my suspicion is that these were simulated usb devices attached by the bmc:
```
model: SPI Flash LUN
name: /dev/disk/by-path/pci-0000:00:14.0-usb-0:4:1.0-scsi-0:0:0:2
```
They actually had no LUN, WWN or serial numbers btw...

Honestly I  have not tried overwriting the cleanup step because I am using Ironic via Metal3 and I don't think there is a possibility to overwrite IPA/Ironic steps in Metal3 yet. Or at least I haven't figured out how to do it. So that is why I was looking for a simpler approach that would work for every type of Ironic deployment.

EDITS:
grammar + typos

Revision history for this message

Jay Faulkner (jason-oldos) wrote on 2024-05-10:

As mentioned in IRC, please provide full output from lspci, lsusb, and anything else we could use to identify programatically that the disk is a fake-BMC-disk.

It's my belief exposing a disk this way is /probably/ a firmware bug in the BMC, but if we can get enough details we can skip this and solve the issue for all folks without having to ignore failure.

Changed in ironic-python-agent:
importance:	Undecided → Wishlist
status:	In Progress → Incomplete
importance:	Wishlist → Undecided

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.