Support for NVME secure erase

Bug #1835954 reported by Gabriel Ramirez on 2019-07-09
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
High
Guilherme G. Piccoli
2.6
High
Lee Trager
2.7
High
Lee Trager
2.8
High
Lee Trager

Bug Description

Doing a secure erase on NVME drives results in falling back to zeroing out the drive as secure_erase() function uses hdparm which doesn't support NVME devices

https://github.com/maas/maas/blob/cf12161dd8bd575422cbc2dec2524a1efcf31d6e/src/metadataserver/user_data/templates/snippets/maas_wipe.py#L145

Tags: sts Edit Tag help

Related branches

Changed in maas:
milestone: none → 2.7.0alpha1
Lee Trager (ltrager) on 2019-07-09
Changed in maas:
status: New → Triaged
Gabriel Ramirez (gabriel1109) wrote :

## This is in relation to customer case 00234532 ##

case is for "release" api call / release device with secure erase which does not fit our requirements.

We wave issue with fallback from secure erase to zero fill.
Our expectation is that secure erase is mandatory and if it cannot be done, device should be marked as broken
or release call should fail.

Additional reason is that zero filling an ssd / nvme drive may not even wipe the data in reality-
(controller may cache the writes and just mark blocks as free to avoid wear and tear).

In addition - I suspect that current wipe script fails to properly verify if secure erase was performed
- Written data is not random
- Write Buffers are not flushed
- Data read after hdparm ioctl secure erase call may be from cache which is not invalidated (bug?) ( echo 3 > /proc/sys/vm/drop_cache helped)
- comparing to all zeros can fail check for drives that were initially empty.
- Flushing caches before and after hdparm commands helped with sporadic failing of secure wipe.

Feature request - wishlist:
- ability to replace or augment default drive wipe scripts with custom release scripts
(commissioning/testing scripts way is great, want same thing for release api call)

Biased opinion:
- hdparm did not work for any drive named /dev/nvme* so that should be ignored by current hdparm wiping script
(I am aware that there are m2 drives which internally use sata interface, not sure if they support ioctl hdparm uses, haven't tested)
- for /dev/nvme*, separate script is needed, using nvme-cli tool to perform secure erase.

NVMe snippet:
NVME_LBAF_512b = 0 # LBAF: format 0 - 512 byte sectors (legacy compatibility?)
NVME_LBAF_4k = 1 # LBAF: format 1 - 4k sectors
NVME_FORMAT_FORGET = 0
NVME_FORMAT_USER = 1
NVME_FORMAT_CRYPTO = 2

def wipe(device,ses,lbaf):
output = subprocess.check_output('nvme format %s --ses=%d --lbaf=%d' % (device, ses, lbaf), shell=True)

# this would wipe just one block device - nvme namespace with Crypto method (replace encryption key)
wipe("/dev/nvme0n1",NVME_FORMAT_CRYPTO, NVME_LBAF_512b);
# and this would wipe the user data on drive
wipe("/dev/nvme0n1",NVME_FORMAT_CRYPTO, NVME_LBAF_512b);

NOTE: some older Intel NVMe drives have physical limit in firmware for total of 100 user data and/or secure wipe cycles. Only SES=0 is supported unlimited number of times (but less secure)
https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/ssd-technical-advisory.pdf

Igor Gnip (igorgnip) wrote :

Can we please have this flagged as security issue since current implementation silently falls back to zero-filling in case secure wipe was not performed (this is a security/privacy risk)

Dan Streetman (ddstreet) wrote :

> Can we please have this flagged as security issue

I subscribed ubuntu-security as a FYI to their team.

Blake Rouse (blake-rouse) wrote :

Unless you selected quick erase MAAS will zero the entire drive. That is not a security issue a wipe of all zeros to an entire drive is a secure erase.

tags: added: sts
Changed in maas:
importance: Undecided → High
Changed in maas:
milestone: 2.7.0b1 → 2.7.0b2
Changed in maas:
milestone: 2.7.0b2 → none
Changed in maas:
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
status: Triaged → In Progress
Guilherme G. Piccoli (gpiccoli) wrote :

I've managed to reproduce the behavior in my MAAS with a VM using nvme device. It got a bit delayed due to LP bug #1873662 (seabios seems to be unable to iPXE-local-boot a guest with NVMe device), eventually I made the guest setup working with OVMF.

Guilherme G. Piccoli (gpiccoli) wrote :
Download full text (4.2 KiB)

Following the NVMe spec 1.2.1, we must use id-ctrl to gather some drive info:

(a) In id-ctrl, the field Optional Admin Command Support (oacs - bits 257:256) provides optional capabilities, like the support for Format NVM Command and namespace management, as per the spec:

Bit 3 if set to ‘1’ then the controller supports the Namespace Management and Namespace Attachment commands. If cleared to ‘0’ then the controller does not support the Namespace Management and Namespace Attachment commands.

Bit 2 if set to ‘1’ then the controller supports the Firmware Commit and Firmware Image Download commands.

Bit 1 if set to ‘1’ then the controller supports the Format NVM command. If cleared to ‘0’ then the controller does not support the Format NVM command.

Bit 0 if set to ‘1’ then the controller supports the Security Send and Security Receive commands.

(b) In id-ctrl, the field "Number of Namespaces" (nn - bits 519:516 in struct) shows the number of namespaces set in the adapter currently, important to keep note on this.

(c) In id-ctrl, the field "Format NVM Attributes" (fna - bit 524 in struct) determines the secure erase/format capabilities:
From spec 1.2.1:

Bit 2 indicates whether cryptographic erase is supported as part of the secure erase functionality. If set to ‘1’, then cryptographic erase is supported. If cleared to ‘0’, then cryptographic erase is not supported.

Bit 1 indicates whether cryptographic erase and user data erase functionality apply to all namespaces or is specific to a particular namespace. If set to’1’, then a cryptographic erase of a particular namespace as part of a format results in a cryptographic erase of all namespaces, and a user data erase of a particular namespace as part of a format results in a user data erase of all namespaces. If cleared to ‘0’, then a cryptographic erase or user data erase as part of a format is performed on a per namespace basis.

Bit 0 indicates whether the format operation applies to all namespaces or is specific
to a particular namespace.

(d) In id-ns , the field "Formatted LBA Size" (flbas - bit 26 in struct) determines the LBA data/metadata size the namespace has been formatted with.
From the spec:

Bit 4 if set to ‘1’ indicates that the metadata is transferred at the end of the data LBA, creating an extended data LBA. Bit 4 if cleared to ‘0’ indicates that all of the metadata for a command is transferred as a separate contiguous buffer of data. Bit 4 is not applicable when there is no metadata.

Bits 3:0 indicates one of the 16 supported LBA Formats indicated in this data structure.

(e) In id-ns, there's a "sub-table" with LBA and MS (Metadata Size) information, example:

lbaf 0 : ms:0 lbads:9 rp:0x2 (in use)
lbaf 1 : ms:8 lbads:9 rp:0x2
lbaf 2 : ms:16 lbads:9 rp:0x2
lbaf 3 : ms:0 lbads:12 rp:0
lbaf 4 : ms:8 lbads:12 rp:0
lbaf 5 : ms:64 lbads:12 rp:0
lbaf 6 : ms:128 lbads:12 rp:0

So, the algorithm outline would be something like this:

(1) Check id-ctrl for "oacs" - if bit 1 is not set, ABORT.

(2a) Check id-ctrl for oacs bit 3 (namespace management support).
(2b) Check id-ns for "nn".
(2c) Check id-ctrl for "fna" bit 1 ( *per-namespace* secure erasin...

Read more...

Igor Gnip (igorgnip) wrote :

Hello Guilherme,

I agree with your assessment and proposed algorithm outline.

However, as long as MAAS silently falls back to zero-filling - that will not be something we can use for our use-case:

Issue #1 : Silent fallback to zero filling does not fit our use case
Issue #2 : NVMe not supported properly.

This will solve Issue #2 which is good but Issue #1 will remain unresolved.

A good compromise would be if the NVMe erase script is implemented as MAAS builtin destructive storage testing script. In such case, I would expect from script to fail on error and not perform any automatic fallback to zero-filling.

Regards,
Igor

Guilherme G. Piccoli (gpiccoli) wrote :

Igor, thanks a lot for your feedback! I think your idea is good and feasible, so let's fix issue #2 as you called the NVMe secure erase support on MAAS, and then we can have it in a builtin script that fails if secure erase is not working. For the regular use case, I think it is a benefit to fallback to zeroing the device if secure erase fails, it's better (i.e, safer) than do nothing and keep the data there.

Cheers,

Guilherme

Igor Gnip (igorgnip) wrote :

Guilherme,

I included the less-proper/less features nvme erase implementation example script you could use as a starting point. You can find it in the mentioned support case attached.

Cheers,

Igor

Dan Streetman (ddstreet) wrote :
Download full text (4.9 KiB)

> So, the algorithm outline would be something like this:

to clarify my understanding, the 'ABORT' just means to not use the nvme 'format' command, and fallback to some other erase method.

Also for reference I'm looking at this nvme spec:
https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf

>
> (1) Check id-ctrl for "oacs" - if bit 1 is not set, ABORT.

yep

>
> (2a) Check id-ctrl for oacs bit 3 (namespace management support).
> (2b) Check id-ns for "nn".

you mean check id-ctrl for nn, but this field doesn't actually refer to the count of namespaces, it refers to the current maximum namespace number. So if you have just a single namespace, but it's NSID is 0x10, then this field would show 0x10.

In sec 6.1.3 "Allocated and Unallocated NSID Types" (and a few following sections), it shows how you can have multiple namespaces, but only the 'allocated' and 'active' ones actually are available in the system (and have anything in them).

At least, that's my reading of the spec...

> (2c) Check id-ctrl for "fna" bit 1 ( *per-namespace* secure erasing support ).
> -> If "fna" bit 1 is set *and* ("oacs" bit 3 is set *and* "nn" > 1), ABORT

well, i think it would be more like:

if (oacs(bit3) == 1)
  if (fna(bit1) == 1)
    if (count(nvme list-ns /dev/nvmeX) > 1)
      ABORT

the nvme 'list-ns' command issues the identify command with CNS set to 0x02 (active namespace id list). You can also use --all to send CNS 0x10 (allocated namespace id list), but I think active is probably what we care about here.

See spec section 5.15.1, specifically figure 244.

And also, I think this situation - a nvme controller that *does* support namespaces, and *does* support the format command, but *doesn't* support per-namespace formatting - seems *really* unlikely. But yeah, per the spec it's possible.

> (unsafe, risking to erase all user's namespaces - see [0] below).
>
> (3) Check "fna" bit 2: if set, we're going to use crypto erase, faster and less
> degrading to device; if not set, we're going to use regular user erase.

yep.

>
> (4) Check id-ns "flbas" to determine the previously used LBA setting.
>
> (5) Execute "nvme format" command with the previously gathered "--ses" option (2 to
> crypto erase or 1 to regular user erase), "--lbaf" option (from id-ns flbas) and
> with a timeout to be determined (see [1] below).

yep, sounds correct.

>
> (6) If any step (1)-(5) fails, fallback to zeroing the nvme device.

There is also the "sanitize" operation, see spec section 8.15, although this is also optional.

And, if neither format nor sanitize are available, there is also the "write zeros" command, spec section 6.17 (unfortunately, this is *also* optional, bit 3 in the ONCS field...seems like everything is optional in the spec...). If write zeros is supported, it also (again, optionally) supports discard. So the 'nvme write-zeroes' command, if the drive supports it, will likely be much faster than actually writing zero-blocks to the entire drive (and, probably much better for the drive, too). And using the -d param (to deallocate/discard the blocks) should help restore nvme performance by 'freeing up' all the drive's blocks...

Read more...

Dan Streetman (ddstreet) wrote :

Oh, and I should correct myself, that "sanitize" might not be fast...the spec seems to indicate it might actually take a long time. And also, I'm not sure if the nvme commandline program supports sanitize...

Igor Gnip (igorgnip) wrote :

Any basic working implementation would be good for start.
Not sure about multiple namespaces - personally I don't think it's needed for first version.

Also, make sure you don't forget there is related bug in linux kernel which gives different device names to namespaces between reboots so /dev/nvme0n1 migth on next reboot be called /dev/nvme1n1

Wish I could add nvme.multipath=0 kernel parameter to maas to use forever but - centos breaks with this parameter.

Related references:
https://github.com/linux-nvme/nvme-cli/issues/455
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1778844/comments/35
http://ubuntu.5.x6.nabble.com/About-nvme-char-block-naming-mismatch-td5186325.html
https://bugs.launchpad.net/curtin/+bug/1830913

Thanks Igor! The namespace "issue" needs to be taken into account to prevent data loss - imagine for example if we secure erase one namespace but given the device's (lack of) capabilities, it ends up clearing all namespaces and hence, deleting user's data not meant to be deleted!

About the Linux kernel "bug" you mentioned, it's not really a bug, but a behavior change that we must cope with - I tried to make it a bit "softer" in the past [0], but it was rejected (and the maintainer explained why it couldn't change). Soon after, other people tried the same approach as mine, and I bet that if you check the list regularly, you'll see more approaches with a certain cadence hehe
A lot of people disliked the new naming scheme that causes char/block devices' mismatch, but seems this is something that won't change anymore.

Cheers,

Guilherme

[0] <email address hidden>/

Igor Gnip (igorgnip) wrote :

Maybe we could just have nvme multipath disabled in maas ephemeral ?

Cheers,
Igor

Download full text (3.4 KiB)

Hi Dan, thanks *a lot* for your input. Very good points, I'll address some of them below, sorry for my delay in responding:

"to clarify my understanding, the 'ABORT' just means to not use the nvme 'format' command"
-> Exactly

"you mean check id-ctrl for nn, but this field doesn't actually refer to the count of namespaces"
-> Agreed, it was my bad understanding of the spec! Thanks for pointing that.

"well, i think it would be more like:
if (oacs(bit3) == 1)
  if (fna(bit1) == 1)
    if (count(nvme list-ns /dev/nvmeX) > 1)
      ABORT"
"if maas is managing the entire box, then i'm not sure why maas would ever want to erase *only* specific namespace(s), not the entire nvme (i.e. all namespaces)."
"if that's the only user-visible erase config choices, then I guess it depends if "erase disks..."
does erase *all* system disks on release, or *only* the disks in the "used" section of the system storage config. If only "used" disks are erased, then it does matter if the secure erase wipes other namespaces...but if maas erases *all* disks, then wiping all namespaces is correct."

-> After your comments, Igor's comments and check the MAAS code in detail, I'd say we should forget about namespaces heheh
MAAS basically checks lsblk output and erase all disks, so...a namespace would show as a disk there, no need to be over-careful with that, let's just erase all namespaces.

"(side note: not sure why it's possible to select *both* secure *and* quick erase...)"
-> Odd right? There's some documentation in MAAS code:

' If both --secure-erase and --quick-erase are specified and the drive does NOT have a secure erase feature, maas-wipe will behave as if only --quick-erase was specified. If --secure-erase is specified and --quick-erase is NOT specified and the drive does NOT have a secure erase feature, maas-wipe will behave as if --secure-erase was NOT specified, i.e. will overwrite the whole disk with null bytes. This can be very slow.'

"There is also the "sanitize" operation, see spec section 8.15, although this is also optional."
-> Optional and pretty *rare* to find, also it may be slow as you commented later, and not all nvme-cli versions support that. I'd rather not mess with sanitize if possible.

"[...] there is also the "write zeros" command, spec section 6.17 (unfortunately, this is *also* optional, bit 3 in the ONCS field...seems like everything is optional in the spec...). If write zeros is supported, it also (again, optionally) supports discard. So the 'nvme write-zeroes' command, if the drive supports it, will likely be much faster than actually writing zero-blocks to the entire drive (and, probably much better for the drive, too). And using the -d param (to deallocate/discard the blocks) should help restore nvme performance by 'freeing up' all the drive's blocks for the firmware to use (e.g. like the 'fstrim' command)."

-> Awesome idea! We can implement that in the zeroing function in MAAS, so if the NVMe device does not support secure erase, we fallback to write-zeroes, much faster and HW-healthier. Also, on the same topic, "quick erase" writes 2MB in the beginning/end of the disk; I might change that, I guess what is more important for q...

Read more...

Igor Gnip (igorgnip) wrote :

Hello,

Recent discovery:

nvme-cli prefers drives to have no partition table.

If partition table was present before wipe, after the wipe is done it might trigger error or warning for failing to update kernel with new partition table data.

It could be beneficial to clear partition tables and flush write buffers before using nvme-cli.
Please note that GPT partition tables also reside (as backup copy) at end of drive.

Some older nvme devices have hard-coded limits on number of secure erase rounds and when that count is used-up device can only do non-secure erase.

Found this while looking for the limit - hope it helps:
https://www.nvmedeveloperdays.com/English/Collaterals/Proceedings/2018/20181204_PRECON2_Hands.pdf

Regards,
Igor

Hi Igor, thanks for your heads-up. What do you mean by "wipe"? Is it the "wipefs" command? We can run partprobe to update kernel partition table after "wipefs", but I don't feel it's necessary in the scenario we're working, since it's MAAS wiping (after it finishes, machine is rebooted).

About the GPT in the end of the disk, you're right (at least wipefs tells me there's a copy in the end of the disk). So, we could run wipefs+erasing the first/final 4M of the disk, this is more than enough I think as a quick erase.

And finally, about the limit on the number of secure-erase operations, I agree, but we are going to try cryptographic erase first, which is pretty fast and doesn't worn the device down at all; in case device doesn't support it, we go with the regular secure erase.

Cheers,

Guilherme

Igor Gnip (igorgnip) wrote :

Hi Guilherme,

yes, executing wipefs -a /dev/nvme?n? should do what is required and inform kernel of change.
Haven't tried it out yet.

Also I would try to rescan controller before invoking nvme-cli to play safe.

    def rescan(cls, kname):
        path = '/sys/block/%s/device/rescan_controller' % kname.decode('utf-8')
        with open(path, "wb", 0) as fp:
            fp.write(b'1\n')
            fp.flush()
            fp.close()

I was not able to confirm there actually is any difference (yet).

Hi all, I've proposed a merge request [0] with the NVMe secure erase code plus one patch more to add wipefs to quick erase (and fix an exception not handled there).
Cheers,

Guilherme

[0] https://code.launchpad.net/~gpiccoli/maas/+git/maas/+merge/386617

Changed in maas:
milestone: none → next
status: In Progress → Fix Committed

Thanks Lee, for backporting to all supported MAAS series =)

Igor Gnip (igorgnip) wrote :

Thank you all for contributing to this fix.
I will be testing change on real hardware as time permits and post comments here.

Changed in maas:
milestone: next → 2.9.0b1
Lee Trager (ltrager) on 2020-09-08
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.