Ironic python agent cleaning fails from CRC mismatch

Bug #1737556 reported by Doug Szumski
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
ironic-python-agent
Fix Released
High
Unassigned

Bug Description

During node cleaning, the generic hardware manager can fail in the `erasing device metadata` step if the GPT is invalid. Specifically this can happen when the hardware manager calls ```sgdisk -Z /dev/somedrive``` to destroy the GPT and MBR data structures.

It isn't clear why sgdisk is validating the GPT when the -Z flag (zap all) instructs sgdisk to destroy the GPT. However, upon retrying sgdisk -Z succeeds.

Example failure:

2017-12-11 12:14:47.449 7 ERROR ironic.drivers.modules.agent_base_vendor [-] Agent returned error for clean step {u'priority': 99, u'interface': u'deploy', u'reboot_requested': False, u'abortable': True,
u'step': u'erase_devices_metadata'} on node 1b973868-9734-4ecf-9700-c0730e97e031 : {u'message': u'Clean step failed: Error performing clean_step erase_devices_metadata: Error erasing block device: Failed
to erase the metadata on the device(s): "/dev/nvme3n1": Unexpected error while running command.\nCommand: sgdisk -Z /dev/nvme3n1\nExit code: 2\nStdout: u"Caution! After loading partitions, the CRC doesn\'
t check out!\\nGPT data structures destroyed! You may now partition the disk using fdisk or\\nother utilities.\\n"\nStderr: u"\\x07Caution: invalid main GPT header, but valid backup; regenerating main hea
der\\nfrom backup!\\n\\n\\x07Warning! Main partition table CRC mismatch! Loaded backup partition table\\ninstead of main partition table!\\n\\nWarning! One or more CRCs don\'t match. You should repair the
 disk!\\n\\nInvalid partition data!\\n"', u'code': 500, u'type': u'CleaningError', u'details': u'Error performing clean_step erase_devices_metadata: Error erasing block device: Failed to erase the metadat
a on the device(s): "/dev/nvme3n1": Unexpected error while running command.\nCommand: sgdisk -Z /dev/nvme3n1\nExit code: 2\nStdout: u"Caution! After loading partitions, the CRC doesn\'t check out!\\nGPT d
ata structures destroyed! You may now partition the disk using fdisk or\\nother utilities.\\n"\nStderr: u"\\x07Caution: invalid main GPT header, but valid backup; regenerating main header\\nfrom backup!\\
n\\n\\x07Warning! Main partition table CRC mismatch! Loaded backup partition table\\ninstead of main partition table!\\n\\nWarning! One or more CRCs don\'t match. You should repair the disk!\\n\\nInvalid
partition data!\\n"'}.

Workaroud:

Retry the cleaning. For example, move the node to the `manage` state, and then to `provide`.

Doug Szumski (dszumski)
description: updated
description: updated
description: updated
Dmitry Tantsur (divius)
Changed in ironic-python-agent:
status: New → Triaged
importance: Undecided → High
Revision history for this message
John Fulton (jfulton-org) wrote :

I also ran into this. I have run the following 4 times but they still fail cleaning.

for node_ident in c05-h17-6048r c05-h21-6048r c05-h25-6048r ; do echo $node_ident ; openstack baremetal node manage $node_ident; openstack baremetal node maintenance unset $node_ident; openstack baremetal node provide $node_ident ; done

Revision history for this message
John Fulton (jfulton-org) wrote :

Here's what I think is happening:

ironic runs the command below but encounters the following:

[root@overcloud-compute-0 ~]# sgdisk -Z /dev/disk/by-path/pci-0000:03:00.0-scsi-0:2:5:0
Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Warning! One or more CRCs don't match. You should repair the disk!

Invalid partition data!
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
[root@overcloud-compute-0 ~]#

[root@overcloud-compute-0 ~]# echo $?
2
[root@overcloud-compute-0 ~]#

Because the return code is non-zero its registered as an error and disk cleaning fails.

One workaround I found was to use gdisk create a blank GPT:

[root@overcloud-compute-0 ~]# gdisk /dev/disk/by-path/pci-0000:03:00.0-scsi-0:2:5:0
GPT fdisk (gdisk) version 0.8.6

Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Warning! One or more CRCs don't match. You should repair the disk!

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: damaged

Found invalid MBR and corrupt GPT. What do you want to do? (Using the
GPT MAY permit recovery of GPT data.)
 1 - Use current GPT
 2 - Create blank GPT

Your answer: 2

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): Y
OK; writing new GUID partition table (GPT) to /dev/disk/by-path/pci-0000:03:00.0-scsi-0:2:5:0.
The operation has completed successfully.
[root@overcloud-compute-0 ~]#

after the above I can run the same command and not get a non-zero return code.

[root@overcloud-compute-0 ~]# sgdisk -Z /dev/disk/by-path/pci-0000:03:00.0-scsi-0:2:5:0
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
[root@overcloud-compute-0 ~]# echo $?
0
[root@overcloud-compute-0 ~]#

[root@overcloud-compute-0 ~]# sgdisk --version
GPT fdisk (sgdisk) version 0.8.6

[root@overcloud-compute-0 ~]#

As per https://www.rodsbooks.com/gdisk/whatsgpt.html:

"GPT adds CRC32 checksums to its data structures and stores those structures twice on the disk—once at the start of the disk and again at the end. These measures help protect the system against accidental damage caused by carelessness or disk errors."

It almost seems like an sgdisk bug that it should care about its own CRC check if we've asked it to delete its CRC structure anyway with -Z, which by definition, destroys GPT and MBR data structures.

Revision history for this message
John Fulton (jfulton-org) wrote :

I found a non-interactive workaround but it requires you to ignore errors from the workaround command.

Reproduce the problem:

[root@overcloud-compute-1 ~]# sgdisk -Z /dev/disk/by-path/pci-0000:03:00.0-scsi-0:2:6:0
Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Warning! One or more CRCs don't match. You should repair the disk!

Invalid partition data!
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
[root@overcloud-compute-1 ~]# echo $?
2
[root@overcloud-compute-1 ~]#

Override earlier problem, though it unfortunately returns an error code:

[root@overcloud-compute-1 ~]# sgdisk --clear /dev/disk/by-path/pci-0000:03:00.0-scsi-0:2:6:0
Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Warning! One or more CRCs don't match. You should repair the disk!

Invalid partition data!
Information: Creating fresh partition table; will override earlier problems!
The operation has completed successfully.
[root@overcloud-compute-1 ~]# echo $?
2
[root@overcloud-compute-1 ~]#

Now -Z does not produce any error code:

[root@overcloud-compute-1 ~]# sgdisk -Z /dev/disk/by-path/pci-0000:03:00.0-scsi-0:2:6:0
Found valid GPT with corrupt MBR; using GPT and will write new
protective MBR on save.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
[root@overcloud-compute-1 ~]# echo $?
0
[root@overcloud-compute-1 ~]#

Revision history for this message
John Fulton (jfulton-org) wrote :

Some feedback from the sgdisk author on this scenario is below. Would the Ironic team be interested in simply changing Ironic's behavior based on the return code of the sgdisk -Z?

---------- Forwarded message ---------
From: Rod Smith
Date: Tue, Aug 7, 2018 at 9:38 AM
Subject: Re: Should sgdisk -Z return an error code if the CRC check doesn't pass?
To: John Fulton

On 08/06/2018 09:33 PM, John Fulton wrote:
> Rod,
>
> First of all thank you for sgdisk; it's a great tool. I ran into an
> issue with it and I am curious if you consider it a bug.
>
> I ran `sgdisk -Z /path/to/dev` and though I received a message that
> the GPT data structures were destroyed, there was a warning that the
> CRCs don't match and the command returned an error code as per echo
> $?.
>
> I would think that if I used -Z that it shouldn't bother with a CRC
> check because it's deleting the GPT data structures anyway but I might
> be mistaken. I'm able to work around it by using gdisk with option 2
> to create a blank GPT, but the sgdisk use in this case is part of a
> larger automation project and the error code causes a failure in the
> larger system [1]. I looked for an option to sgdisk to have it ignore
> the CRC check but couldn't find one.
>
> If you think this is a bug I would be happy to file it. I looked for a
> bug tracker but couldn't find one on the project's sourceforge page.

The design of GPT fdisk is such that it ALWAYS tries to read partition
data structures from the disk; the code to read the command line
arguments comes AFTER the program tries to parse the partition table
data. Changing this is not simply a matter of moving a block of code,
either; it's fundamental to the design of the C++ class structures used
by the program.

As to the return code, that could more easily be changed; however, I'm
reluctant to do so because sgdisk is used by a large number of scripts,
some of which may rely on the current behavior. Thus, although changing
sgdisk to not return an error code when you wipe a damaged disk with
"-Z" makes conceptual sense, that change could cause problems for others.

In practice, these issues are both easily overcome -- if you expect your
script will be calling sgdisk on a disk that has no valid partition
table, you can redirect the output to /dev/null and ignore the return code.

summary: - Ironic python agent cleaning fails with invalid GPT
+ Ironic python agent cleaning fails from CRC mismatch
Revision history for this message
John Fulton (jfulton-org) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic-python-agent 3.4.0

This issue was fixed in the openstack/ironic-python-agent 3.4.0 release.

Revision history for this message
guochuanchuan (ruanben) wrote :

I found the fix info of this problem in ironic-python-agent 3.4.0 release is :

https://github.com/openstack/ironic-python-agent/commit/bc21b5b1404c269d000e1de6f9dce9c046f7d15d

but i meet the problem again with the ironic-python-agent/shell/write_image.sh same as the version 3.4.0:

the error info when i deploy the node:

2018-10-30 12:25:14.576 986407 ERROR ironic.drivers.modules.agent_base_vendor InstanceDeployFailure: Failed to deploy instance: Failed to start the iSCSI target to deploy the node b9eeb23b-003b-4770-963a-16f8a08f6b4f. Error: {u'message': u'Unexpected error while running command.\nCommand: sgdisk -Z /dev/sda\nExit code: 2\nStdout: u"Caution! After loading partitions, the CRC doesn\'t check out!\\nGPT data structures destroyed! You may now partition the disk using fdisk or\\nother utilities.\\n"\nStderr: u"\\x07Caution: invalid main GPT header, but valid backup; regenerating main header\\nfrom backup!\\n\\n\\x07Warning! Main partition table CRC mismatch! Loaded backup partition table\\ninstead of main partition table!\\n\\nWarning! One or more CRCs don\'t match. You should repair the disk!\\n\\nInvalid partition data!\\n"', u'code': 500, u'type': u'ProcessExecutionError', u'details': u''}

Changed in ironic-python-agent:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.