nvme-cli: fguid is printed as binary data and causes MAAS to fail erasing NVME disks

Bug #2051299 reported by maasuser1
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Unassigned
maas-images
Triaged
Medium
Unassigned
nvme-cli (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Medium
Matthew Ruffell

Bug Description

[Impact]

When a user tries to release a system deployed with MAAS, that has erase disks on release set, erasing NVME disks fails on Jammy.

Traceback (most recent call last):
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 542, in <module>
main()
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 522, in main
disk_info = get_disk_info()
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 165, in get_disk_info
return {kname: get_disk_security_info(kname) for kname in list_disks()}
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 165, in <dictcomp>
return {kname: get_disk_security_info(kname) for kname in list_disks()}
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 158, in get_disk_security_info
return get_nvme_security_info(disk)
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 64, in get_nvme_security_info
output = output.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 385: invalid start byte

This is due to maas_wipe.py running "nvme id-ctrl <device>" and parsing the results. This should be human readable data, in string format, so utf-8 should be appropriate for MAAS to use.

Instead, the "fguid" field is being printed as binary data, and is not parsable as utf-8.

e.g. From comment #8.

The user sees:

`fguid : 2.`

on closer inspection, the hex is:

x32,0x89,0x82,0x2E

Note it is cut off early, likely because the next byte would be 0x00, and is being interprested as a null byte.

Fix nvme-cli such that we print out the fguid as a correct utf-8 string, so MAAS works as intended.

[Testcase]

Deploy Jammy onto a system that has a NVME device.

$ sudo apt install nvme-cli

Run the 'id-ctrl' command and look at the fguid entry:

$ sudo nvme id-ctrl /dev/nvme1n1 | grep fguid
fguid :

Due to the UUID being all zeros, this was interpreted as a null byte, and the UUID was not printed correctly.

There is a test package available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf387274-test

If you install the test package, the fguid will be printed as a proper string:

$ sudo nvme id-ctrl /dev/nvme1n1 | grep fguid
fguid : 00000000-0000-0000-0000-000000000000

Also check that json output works as expected:

$ sudo nvme id-ctrl -o json /dev/nvme1n1 | grep fguid
  "fguid" : "00000000-0000-0000-0000-000000000000",

Additionally, also test that the new package allows a MAAS deployed system to
be released correctly with the erase option enabled, as maas_wipe.py should now
complete successfully.

[Where problems could occur]

We are changing the output of the 'id-ctrl' subcommand. No other subcommands are changed. Users who for some reason rely on broken, incomplete binary data that is printed might be impacted. For users doing a hard diff of the command output, the output will now change to reflect the actual fguid, and might need a change. The fguid is now supplied in json output for 'id-ctrl', and might change programs parsing the json object.

There are no workarounds, and if a regression were to occur, it would only affect the 'id-ctrl' subcommand, and not change anything else.

[Other info]

Upstream bug:
https://github.com/linux-nvme/nvme-cli/issues/1653

This was fixed in the below commit in version 2.2, found in mantic and later:

commit 78b7ad235507ddd59c75c7fcc74fc6c927811f87
From: Pierre Labat <email address hidden>
Date: Fri, 26 Aug 2022 17:02:08 -0500
Subject: nvme-print: Print fguid as a UUID
Link: https://github.com/linux-nvme/nvme-cli/commit/78b7ad235507ddd59c75c7fcc74fc6c927811f87

The commit required a minor backport. In later versions, a major refactor occurred that changed nvme_uuid_to_string() among numerous other functions, that is not appropriate to backport. Instead, just take the current implementation of nvme_uuid_to_string() and move it like the patch suggests, so json output works correctly.

maasuser1 (maasuser1)
description: updated
Revision history for this message
Javier Fuentes (javier-fs) wrote :

Hi maasuser,

Can you run the next commands and post the output?

> nvme id-ctrl $disk

> nvme id-ns $disk

Revision history for this message
maasuser1 (maasuser1) wrote :

Hi Javier,

Where and when to run these commands? Thanks!

Revision history for this message
Javier Fuentes (javier-fs) wrote :

You can run them at anytime in the Supermicro SYS-6019P-WT.

Revision history for this message
maasuser1 (maasuser1) wrote :

Hi Javier,

Before processing further, I realised I met a similar scenario with https://discourse.maas.io/t/can-maas-be-used-to-deploy-ubuntu-in-uefi-boot-mode/7594/12 .

The deployment/wipe issue also happened after I upgraded both BMC and BIOS firmware on Supermicro servers.

Now, I changed the settings as described below
- `Boot mode select` in Supermicro BIOS: from `Dual` to `UEFI Only`
- Boot devices order in Supermicro BIOS: `UEFI Network` -> 1, `UEFI Harddrive` -> 2
- `Power boot type` on MAAS machine setting: from `Automatic` to `EFI boot`.

Now the machine deployment works fine on these Supermicro nodes, but disk erasing during machine release still failed. I will reach back to you later.

Revision history for this message
maasuser1 (maasuser1) wrote :

```
ubuntu@dp1:~$ sudo nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid : 0x1344
ssvid : 0x1344
sn : 21393289822E
mn : Micron_7400_MTFDKBA960TDZ
fr : E1MU23BC
rab : 3
ieee : 00a075
cmic : 0
mdts : 10
cntlid : 0
ver : 0x10400
rtd3r : 0x1e8480
rtd3e : 0x1e8480
oaes : 0x300
ctratt : 0x80
rrls : 0
cntrltype : 1
fguid : 2.
crdt1 : 0
crdt2 : 0
crdt3 : 0
nvmsr : 1
vwci : 255
mec : 3
oacs : 0x5e
acl : 3
aerl : 7
frmw : 0x17
lpa : 0x1e
elpe : 255
npss : 4
avscc : 0x1
apsta : 0
wctemp : 343
cctemp : 358
mtfa : 60
hmpre : 0
hmmin : 0
tnvmcap : 960197124096
unvmcap : 0
rpmbs : 0
edstt : 2
dsto : 1
fwug : 1
kas : 0
hctma : 0x1
mntmt : 343
mxtmt : 358
sanicap : 0xa0000003
hmminds : 0
hmmaxd : 0
nsetidmax : 0
endgidmax : 0
anatt : 0
anacap : 0
anagrpmax : 0
nanagrpid : 0
pels : 512
domainid : 0
megcap : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 128
oncs : 0x5f
fuses : 0x1
fna : 0x4
vwc : 0x6
awun : 15
awupf : 15
icsvscc : 1
nwpc : 0
acwu : 15
ocfs : 0
sgls : 0
mnan : 0
maxdna : 0
maxcna : 0
subnqn : nqn.2016-08.com.micron:nvme:nvm-subsystem-sn-21393289822E
ioccsz : 0
iorcsz : 0
icdoff : 0
fcatt : 0
msdbd : 0
ofcs : 0
ps 0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
ps 1 : mp:7.50W operational enlat:10 exlat:10 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
ps 2 : mp:7.50W operational enlat:10 exlat:10 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
ps 3 : mp:7.50W operational enlat:10 exlat:10 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
ps 4 : mp:5.50W operational enlat:10 exlat:10 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
```

```
ubuntu@dp1:~$ sudo nvme id-ns /dev/nvme0n1
NVME Identify Namespace 1:
nsze : 0x6fc81ab0
ncap : 0x6fc81ab0
nuse : 0x256c358
nsfeat : 0x16
nlbaf : 1
flbas : 0
mc : 0
dpc : 0
dps : 0
nmic : 0
rescap : 0
fpi : 0x80
dlfeat : 9
nawun : 127
nawupf : 127
nacwu : 127
nabsn : 0
nabo : 0
nabspf : 0
noiob : 0
nvmcap : 960197124096
npwg : 7
npwa : 7
npdg : 7
npda : 7
nows : 7
mssrl : 0
mcl : 0
msrc : 0
anagrpid: 0
nsattr : 0
nvmsetid: 0
endgid : 0
nguid : 000000000000000100a075213289822e
eui64 : 00a075013289822e
lbaf 0 : ms:0 lbads:9 rp:0x2 (in use)
lbaf 1 : ms:0 lbads:12 rp:0
```

Revision history for this message
maasuser1 (maasuser1) wrote :

On the same hardware, I deployed Ubuntu 20.04 and then performed releasing, the disk was erased successfully.

```
2024-02-01T22:17:40+00:00 dp1 cloud-init[2556]: Setting up nvme-cli (1.9-1ubuntu0.1) ...
2024-02-01T22:17:40+00:00 dp1 cloud-init[2556]: Processing triggers for man-db (2.9.1-1) ...
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: sda, nvme0n1 to be wiped.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: sda: starting quick wipe.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: sda: successfully quickly wiped.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: nvme0n1: starting quick wipe.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: nvme0n1: successfully quickly wiped.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: All disks have been successfully wiped.
```

It may be an issue with `nvme-cli` + `maas-wipe` + NVMe SSD running on Ubuntu 22.04.

Revision history for this message
maasuser1 (maasuser1) wrote :

I hooked maas-wipe and got the following [output](
https://github.com/maas/maas/blob/deab73792a4fe839a2e84a926a6c728d510fc9ad/src/metadataserver/builtin_scripts/release_scripts/maas_wipe.py#L89)

There are two invalid characters in the "fguid" value.

https://hexed.it/#base64:output.txt;Z3VpZCAgICAgOiAyiYIu

If I run `sudo nvme id-ctrl /dev/nvme0n1` manually, it shows `fguid : 2.` (those two chars are invisible) but the invalid characters are still there if you capture the output as a file and then view it in hex mode.

Revision history for this message
maasuser1 (maasuser1) wrote :

According to the [NVMe Base Specification](https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf), the FRU Globally Unique Identifier (FGUID) is a 128-bit value.

But the `nvme-cli` (installed version is v1.16 which is the latest available on Ubuntu 22.04 repo, but on GitHub the latest version should be [v2.7.1](https://github.com/linux-nvme/nvme-cli/releases)) in this case only returns 4 bytes of binary: `0x32,0x89,0x82,0x2E` and shows them as a string instead of hex stream, which directly causes the problem. MAAS did not handle the decode failure on [line 97](https://github.com/maas/maas/blob/deab73792a4fe839a2e84a926a6c728d510fc9ad/src/metadataserver/builtin_scripts/release_scripts/maas_wipe.py#L97), and eventually entered `Failed disk erasing` state.

More details about the disk:
```
ubuntu@dp1:~$ sudo fwupdmgr get-devices
SYS-6019P-WT

└─Micron 7400 MTFDKBA960TDZ:
      Device ID: 71b677ca0f1bc2c5b804fa1d59e52064ce589293
      Summary: NVM Express solid state drive
      Current version: E1MU23BC
      Vendor: Micron Technology Inc (NVME:0x1344)
      Serial Number: 21393289822E
      GUIDs: 875703a0-d8b5-557b-8523-a6901d424cfe ← NVME\VEN_1344&DEV_51C0&SUBSYS_13442100&REV_02
                          07bf5728-ae3e-581e-96cb-c785e020e4cd ← NVME\VEN_1344&DEV_51C0&SUBSYS_13442100
                          494cf0e1-dca2-5d7d-a865-fa2e25f1bd7a ← NVME\VEN_1344&DEV_51C0&REV_02
                          a026312f-a9f6-5a37-89ca-597a1ec280d3 ← NVME\VEN_1344&DEV_51C0
                          a54d1f62-3d9d-50c0-bf9a-0074f98ae378 ← Micron_7400_MTFDKBA960TDZ
      Device Flags: • Internal device
                          • Updatable
                          • System requires external power source
                          • Needs a reboot after installation
                          • Device is usable for the duration of the update
```

Revision history for this message
maasuser1 (maasuser1) wrote :

I upgraded the firmware of the SSD from `E1MU23BC` to `E1MU23Y5`, but the problem still exists.

Revision history for this message
maasuser1 (maasuser1) wrote :

On Ubuntu 20.04, `nvme id-ctrl /dev/nvme0n1` does not return `fguid`, so it does not cause any problems.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Thanks maasuser1 for the analysis.

The nmve-cli issue https://github.com/linux-nvme/nvme-cli/issues/1653 has been fixed in https://github.com/linux-nvme/nvme-cli/commit/78b7ad235507ddd59c75c7fcc74fc6c927811f87 and released in https://github.com/linux-nvme/nvme-cli/releases/tag/v2.2

jammy uses nvmi-cli in version 1.16, mantic has a later version (2.5). I'm not sure if the bugfix can be backported to jammy, so nvme-cli package was added as affected for further triage.

Revision history for this message
Björn Tillenius (bjornt) wrote :

We could fix MAAS to not try to decode the output using utf-8.

Changed in maas:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 3.5.x
importance: Medium → High
Revision history for this message
Grant Byrne (grantbyrne) wrote :

We have also experienced the same issue as reported here: https://bugs.launchpad.net/maas/+bug/2065820

Would like to see MAAS updated to support this. Our code fix over in the other ticket.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvme-cli (Ubuntu):
status: New → Confirmed
summary: - Failed to wipe Micron 7400 MTFDKBA960TDZ during machine release
+ nvme-cli: fguid is printed as binary data and causes MAAS to fail
+ erasing NVME disks
description: updated
tags: added: sts
description: updated
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for nvme-cli on jammy which fixes this issue.

Changed in nvme-cli (Ubuntu Jammy):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Matthew Ruffell (mruffell)
Changed in nvme-cli (Ubuntu):
status: Confirmed → Fix Released
description: updated
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a V2 that corrects a minor omission in the changelog.

Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

Thanks for the fix, Matthew! I agree with your backport, keeping nvme_uuid_to_string() as is seems to be the right approach.
Given Focal does not seem to support the 'fguid' field, I've sponsored this for Jammy only.

Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello maasuser1, or anyone else affected,

Accepted nvme-cli into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nvme-cli/1.16-3ubuntu0.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in nvme-cli (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Performing verification for jammy.

I started a fresh jammy VM, that had an attached 50gb NVMe device:

$ lsblk
nvme0n1 259:0 0 46.6G 0 disk
nvme1n1 259:1 0 8G 0 disk
├─nvme1n1p1 259:2 0 7.9G 0 part /
├─nvme1n1p14 259:3 0 4M 0 part
└─nvme1n1p15 259:4 0 106M 0 part /boot/efi

I installed nvme-cli 1.16-3ubuntu0.1 from -updates, and ran:

$ sudo nvme id-ctrl /dev/nvme0n1 | grep fguid
fguid :

$ sudo nvme id-ctrl -o json /dev/nvme0n1 | grep fguid
$

The fguid on this VM is set to all zeros, so they are being interpreted as a null byte and we get no output for this field.

I then enabled -proposed and installed nvme-cli 1.16-3ubuntu0.2, and re-ran the following:

$ sudo nvme id-ctrl /dev/nvme0n1 | grep fguid
fguid : 00000000-0000-0000-0000-000000000000

$ sudo nvme id-ctrl -o json /dev/nvme0n1 | grep fguid
  "fguid" : "00000000-0000-0000-0000-000000000000",

Now we are actually getting correct output, and see the actual UUID value being printed.

I also ran the following python script that more or less does what maas_wipe.py does:

import subprocess

output = subprocess.check_output(["nvme", "id-ctrl", "/dev/nvme0n1"])
output = output.decode()

It completes successfully under 1.16-3ubuntu0.2, so I think MAAS is good to go now.

The package in -proposed fixes the problem. Happy to mark verified for jammy.

tags: added: verification-done-jammy
removed: verification-needed verification-needed-jammy
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

When nvme-cli 1.16-3ubuntu0.2 makes it to jammy-updates (currently with 1.16-3ubuntu0.1), mark the MAAS issue as resolved.

Changed in maas-images:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nvme-cli - 1.16-3ubuntu0.2

---------------
nvme-cli (1.16-3ubuntu0.2) jammy; urgency=medium

  * Fix 'id-ctrl' command to correctly output the value of fguid, as a proper
    UUID instead of binary data. This corrects an issue in MAAS where it
    cannot parse the results of 'id-ctrl' due to the binary data not being
    utf-8 formatted. (LP: #2051299)
    - d/p/lp2051299-nvme-print-Print-fguid-as-a-UUID.patch

 -- Matthew Ruffell <email address hidden> Thu, 13 Jun 2024 15:21:49 +1200

Changed in nvme-cli (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for nvme-cli has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Changed in maas:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.