nvme-cli: fguid is printed as binary data and causes MAAS to fail erasing NVME disks

Bug #2051299 reported by maasuser1
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
High
Unassigned
nvme-cli (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
In Progress
Medium
Matthew Ruffell

Bug Description

[Impact]

When a user tries to release a system deployed with MAAS, that has erase disks on release set, erasing NVME disks fails on Jammy.

Traceback (most recent call last):
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 542, in <module>
main()
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 522, in main
disk_info = get_disk_info()
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 165, in get_disk_info
return {kname: get_disk_security_info(kname) for kname in list_disks()}
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 165, in <dictcomp>
return {kname: get_disk_security_info(kname) for kname in list_disks()}
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 158, in get_disk_security_info
return get_nvme_security_info(disk)
File "/tmp/user_data.sh.jNE4lC/bin/maas-wipe", line 64, in get_nvme_security_info
output = output.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 385: invalid start byte

This is due to maas_wipe.py running "nvme id-ctrl <device>" and parsing the results. This should be human readable data, in string format, so utf-8 should be appropriate for MAAS to use.

Instead, the "fguid" field is being printed as binary data, and is not parsable as utf-8.

e.g. From comment #8.

The user sees:

`fguid : 2.`

on closer inspection, the hex is:

x32,0x89,0x82,0x2E

Note it is cut off early, likely because the next byte would be 0x00, and is being interprested as a null byte.

Fix nvme-cli such that we print out the fguid as a correct utf-8 string, so MAAS works as intended.

[Testcase]

Deploy Jammy onto a system that has a NVME device.

$ sudo apt install nvme-cli

Run the 'id-ctrl' command and look at the fguid entry:

$ sudo nvme id-ctrl /dev/nvme1n1 | grep fguid
fguid :

Due to the UUID being all zeros, this was interpreted as a null byte, and the UUID was not printed correctly.

There is a test package available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf387274-test

If you install the test package, the fguid will be printed as a proper string:

$ sudo nvme id-ctrl /dev/nvme1n1 | grep fguid
fguid : 00000000-0000-0000-0000-000000000000

Also check that json output works as expected:

$ sudo nvme id-ctrl -o json /dev/nvme1n1 | grep fguid
  "fguid" : "00000000-0000-0000-0000-000000000000",

Additionally, also test that the new package allows a MAAS deployed system to
be released correctly with the erase option enabled, as maas_wipe.py should now
complete successfully.

[Where problems could occur]

We are changing the output of the 'id-ctrl' subcommand. No other subcommands are changed. Users who for some reason rely on broken, incomplete binary data that is printed might be impacted. For users doing a hard diff of the command output, the output will now change to reflect the actual fguid, and might need a change. The fguid is now supplied in json output for 'id-ctrl', and might change programs parsing the json object.

There are no workarounds, and if a regression were to occur, it would only affect the 'id-ctrl' subcommand, and not change anything else.

[Other info]

Upstream bug:
https://github.com/linux-nvme/nvme-cli/issues/1653

This was fixed in the below commit in version 2.2, found in mantic and later:

commit 78b7ad235507ddd59c75c7fcc74fc6c927811f87
From: Pierre Labat <email address hidden>
Date: Fri, 26 Aug 2022 17:02:08 -0500
Subject: nvme-print: Print fguid as a UUID
Link: https://github.com/linux-nvme/nvme-cli/commit/78b7ad235507ddd59c75c7fcc74fc6c927811f87

The commit required a minor backport. In later versions, a major refactor occurred that changed nvme_uuid_to_string() among numerous other functions, that is not appropriate to backport. Instead, just take the current implementation of nvme_uuid_to_string() and move it like the patch suggests, so json output works correctly.

Tags: sts
maasuser1 (maasuser1)
description: updated
Revision history for this message
Javier Fuentes (javier-fs) wrote :

Hi maasuser,

Can you run the next commands and post the output?

> nvme id-ctrl $disk

> nvme id-ns $disk

Revision history for this message
maasuser1 (maasuser1) wrote :

Hi Javier,

Where and when to run these commands? Thanks!

Revision history for this message
Javier Fuentes (javier-fs) wrote :

You can run them at anytime in the Supermicro SYS-6019P-WT.

Revision history for this message
maasuser1 (maasuser1) wrote :

Hi Javier,

Before processing further, I realised I met a similar scenario with https://discourse.maas.io/t/can-maas-be-used-to-deploy-ubuntu-in-uefi-boot-mode/7594/12 .

The deployment/wipe issue also happened after I upgraded both BMC and BIOS firmware on Supermicro servers.

Now, I changed the settings as described below
- `Boot mode select` in Supermicro BIOS: from `Dual` to `UEFI Only`
- Boot devices order in Supermicro BIOS: `UEFI Network` -> 1, `UEFI Harddrive` -> 2
- `Power boot type` on MAAS machine setting: from `Automatic` to `EFI boot`.

Now the machine deployment works fine on these Supermicro nodes, but disk erasing during machine release still failed. I will reach back to you later.

Revision history for this message
maasuser1 (maasuser1) wrote :

```
ubuntu@dp1:~$ sudo nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid : 0x1344
ssvid : 0x1344
sn : 21393289822E
mn : Micron_7400_MTFDKBA960TDZ
fr : E1MU23BC
rab : 3
ieee : 00a075
cmic : 0
mdts : 10
cntlid : 0
ver : 0x10400
rtd3r : 0x1e8480
rtd3e : 0x1e8480
oaes : 0x300
ctratt : 0x80
rrls : 0
cntrltype : 1
fguid : 2.
crdt1 : 0
crdt2 : 0
crdt3 : 0
nvmsr : 1
vwci : 255
mec : 3
oacs : 0x5e
acl : 3
aerl : 7
frmw : 0x17
lpa : 0x1e
elpe : 255
npss : 4
avscc : 0x1
apsta : 0
wctemp : 343
cctemp : 358
mtfa : 60
hmpre : 0
hmmin : 0
tnvmcap : 960197124096
unvmcap : 0
rpmbs : 0
edstt : 2
dsto : 1
fwug : 1
kas : 0
hctma : 0x1
mntmt : 343
mxtmt : 358
sanicap : 0xa0000003
hmminds : 0
hmmaxd : 0
nsetidmax : 0
endgidmax : 0
anatt : 0
anacap : 0
anagrpmax : 0
nanagrpid : 0
pels : 512
domainid : 0
megcap : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 128
oncs : 0x5f
fuses : 0x1
fna : 0x4
vwc : 0x6
awun : 15
awupf : 15
icsvscc : 1
nwpc : 0
acwu : 15
ocfs : 0
sgls : 0
mnan : 0
maxdna : 0
maxcna : 0
subnqn : nqn.2016-08.com.micron:nvme:nvm-subsystem-sn-21393289822E
ioccsz : 0
iorcsz : 0
icdoff : 0
fcatt : 0
msdbd : 0
ofcs : 0
ps 0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
ps 1 : mp:7.50W operational enlat:10 exlat:10 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
ps 2 : mp:7.50W operational enlat:10 exlat:10 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
ps 3 : mp:7.50W operational enlat:10 exlat:10 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
ps 4 : mp:5.50W operational enlat:10 exlat:10 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:3.10W active_power:-
```

```
ubuntu@dp1:~$ sudo nvme id-ns /dev/nvme0n1
NVME Identify Namespace 1:
nsze : 0x6fc81ab0
ncap : 0x6fc81ab0
nuse : 0x256c358
nsfeat : 0x16
nlbaf : 1
flbas : 0
mc : 0
dpc : 0
dps : 0
nmic : 0
rescap : 0
fpi : 0x80
dlfeat : 9
nawun : 127
nawupf : 127
nacwu : 127
nabsn : 0
nabo : 0
nabspf : 0
noiob : 0
nvmcap : 960197124096
npwg : 7
npwa : 7
npdg : 7
npda : 7
nows : 7
mssrl : 0
mcl : 0
msrc : 0
anagrpid: 0
nsattr : 0
nvmsetid: 0
endgid : 0
nguid : 000000000000000100a075213289822e
eui64 : 00a075013289822e
lbaf 0 : ms:0 lbads:9 rp:0x2 (in use)
lbaf 1 : ms:0 lbads:12 rp:0
```

Revision history for this message
maasuser1 (maasuser1) wrote :

On the same hardware, I deployed Ubuntu 20.04 and then performed releasing, the disk was erased successfully.

```
2024-02-01T22:17:40+00:00 dp1 cloud-init[2556]: Setting up nvme-cli (1.9-1ubuntu0.1) ...
2024-02-01T22:17:40+00:00 dp1 cloud-init[2556]: Processing triggers for man-db (2.9.1-1) ...
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: sda, nvme0n1 to be wiped.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: sda: starting quick wipe.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: sda: successfully quickly wiped.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: nvme0n1: starting quick wipe.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: nvme0n1: successfully quickly wiped.
2024-02-01T22:17:43+00:00 dp1 cloud-init[2556]: All disks have been successfully wiped.
```

It may be an issue with `nvme-cli` + `maas-wipe` + NVMe SSD running on Ubuntu 22.04.

Revision history for this message
maasuser1 (maasuser1) wrote :

I hooked maas-wipe and got the following [output](
https://github.com/maas/maas/blob/deab73792a4fe839a2e84a926a6c728d510fc9ad/src/metadataserver/builtin_scripts/release_scripts/maas_wipe.py#L89)

There are two invalid characters in the "fguid" value.

https://hexed.it/#base64:output.txt;Z3VpZCAgICAgOiAyiYIu

If I run `sudo nvme id-ctrl /dev/nvme0n1` manually, it shows `fguid : 2.` (those two chars are invisible) but the invalid characters are still there if you capture the output as a file and then view it in hex mode.

Revision history for this message
maasuser1 (maasuser1) wrote :

According to the [NVMe Base Specification](https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf), the FRU Globally Unique Identifier (FGUID) is a 128-bit value.

But the `nvme-cli` (installed version is v1.16 which is the latest available on Ubuntu 22.04 repo, but on GitHub the latest version should be [v2.7.1](https://github.com/linux-nvme/nvme-cli/releases)) in this case only returns 4 bytes of binary: `0x32,0x89,0x82,0x2E` and shows them as a string instead of hex stream, which directly causes the problem. MAAS did not handle the decode failure on [line 97](https://github.com/maas/maas/blob/deab73792a4fe839a2e84a926a6c728d510fc9ad/src/metadataserver/builtin_scripts/release_scripts/maas_wipe.py#L97), and eventually entered `Failed disk erasing` state.

More details about the disk:
```
ubuntu@dp1:~$ sudo fwupdmgr get-devices
SYS-6019P-WT

└─Micron 7400 MTFDKBA960TDZ:
      Device ID: 71b677ca0f1bc2c5b804fa1d59e52064ce589293
      Summary: NVM Express solid state drive
      Current version: E1MU23BC
      Vendor: Micron Technology Inc (NVME:0x1344)
      Serial Number: 21393289822E
      GUIDs: 875703a0-d8b5-557b-8523-a6901d424cfe ← NVME\VEN_1344&DEV_51C0&SUBSYS_13442100&REV_02
                          07bf5728-ae3e-581e-96cb-c785e020e4cd ← NVME\VEN_1344&DEV_51C0&SUBSYS_13442100
                          494cf0e1-dca2-5d7d-a865-fa2e25f1bd7a ← NVME\VEN_1344&DEV_51C0&REV_02
                          a026312f-a9f6-5a37-89ca-597a1ec280d3 ← NVME\VEN_1344&DEV_51C0
                          a54d1f62-3d9d-50c0-bf9a-0074f98ae378 ← Micron_7400_MTFDKBA960TDZ
      Device Flags: • Internal device
                          • Updatable
                          • System requires external power source
                          • Needs a reboot after installation
                          • Device is usable for the duration of the update
```

Revision history for this message
maasuser1 (maasuser1) wrote :

I upgraded the firmware of the SSD from `E1MU23BC` to `E1MU23Y5`, but the problem still exists.

Revision history for this message
maasuser1 (maasuser1) wrote :

On Ubuntu 20.04, `nvme id-ctrl /dev/nvme0n1` does not return `fguid`, so it does not cause any problems.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Thanks maasuser1 for the analysis.

The nmve-cli issue https://github.com/linux-nvme/nvme-cli/issues/1653 has been fixed in https://github.com/linux-nvme/nvme-cli/commit/78b7ad235507ddd59c75c7fcc74fc6c927811f87 and released in https://github.com/linux-nvme/nvme-cli/releases/tag/v2.2

jammy uses nvmi-cli in version 1.16, mantic has a later version (2.5). I'm not sure if the bugfix can be backported to jammy, so nvme-cli package was added as affected for further triage.

Revision history for this message
Björn Tillenius (bjornt) wrote :

We could fix MAAS to not try to decode the output using utf-8.

Changed in maas:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 3.5.x
importance: Medium → High
Revision history for this message
Grant Byrne (grantbyrne) wrote :

We have also experienced the same issue as reported here: https://bugs.launchpad.net/maas/+bug/2065820

Would like to see MAAS updated to support this. Our code fix over in the other ticket.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvme-cli (Ubuntu):
status: New → Confirmed
summary: - Failed to wipe Micron 7400 MTFDKBA960TDZ during machine release
+ nvme-cli: fguid is printed as binary data and causes MAAS to fail
+ erasing NVME disks
description: updated
tags: added: sts
description: updated
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for nvme-cli on jammy which fixes this issue.

Changed in nvme-cli (Ubuntu Jammy):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Matthew Ruffell (mruffell)
Changed in nvme-cli (Ubuntu):
status: Confirmed → Fix Released
description: updated
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a V2 that corrects a minor omission in the changelog.

Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

Thanks for the fix, Matthew! I agree with your backport, keeping nvme_uuid_to_string() as is seems to be the right approach.
Given Focal does not seem to support the 'fguid' field, I've sponsored this for Jammy only.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.