smartctl-validate is borked in a recent release

Bug #1869116 reported by cees
58
This bug affects 10 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Lee Trager
2.7
New
Undecided
Unassigned
2.8
Fix Released
Critical
Lee Trager
lxd
Fix Released
Unknown

Bug Description

Bug (maybe?) details first, diatribe second.

Bug Summary: multi-hdd / raid with multiple drives / multiple devices or something along those lines cannot be commissioned anymore: 2.4.x worked fine, 2.7.0 does not.

Here is the script output of smartctl-validate:

-----
# /dev/sda (Model: PERC 6/i, Serial: 6842b2b0740e9900260e66c9220df4ac)

Unable to run 'smartctl-validate': Storage device 'PERC 6/i' with serial '6842b2b0740e9900260e66c9220df4ac' not found!
This indicates the storage device has been removed or the OS is unable to find it due to a hardware failure. Please re-commission this node to re-discover the storage devices, or delete this device manually.
Given parameters:
{'storage': {'argument_format': '{path
        }', 'type': 'storage', 'value': {'id_path': '/dev/disk/by-id/wwn-0x6842b2b0740e9900260e66c9220df4ac', 'model': 'PERC 6/i', 'name': 'sda', 'physical_blockdevice_id': 33, 'serial': '6842b2b0740e9900260e66c9220df4ac'
        }
    }
}
Discovered storage devices: [
    {'NAME': 'sda', 'MODEL': 'PERC_6/i', 'SERIAL': '6842b2b0740e9900260e66c9220df4ac'
    },
    {'NAME': 'sdb', 'MODEL': 'PERC_6/i', 'SERIAL': '6842b2b0740e9900260e66f924ecece0'
    },
    {'NAME': 'sr0', 'MODEL': 'TEAC_DVD-ROM_DV-28SW', 'SERIAL': '10092013112645'
    }
]
Discovered interfaces: {'xx: xx: xx: xx: xx: xx': 'eno1'
}
-----
-----
# /dev/sdb (Model: PERC 6/i, Serial: 6842b2b0740e9900260e66f924ecece0)
Unable to run 'smartctl-validate': Storage device 'PERC 6/i' with serial '6842b2b0740e9900260e66f924ecece0' not found!
This indicates the storage device has been removed or the OS is unable to find it due to a hardware failure. Please re-commission this node to re-discover the storage devices, or delete this device manually.
Given parameters: {'storage': {'argument_format': '{path
        }', 'type': 'storage', 'value': {'id_path': '/dev/disk/by-id/wwn-0x6842b2b0740e9900260e66f924ecece0', 'model': 'PERC 6/i', 'name': 'sdb', 'physical_blockdevice_id': 34, 'serial': '6842b2b0740e9900260e66f924ecece0'
        }
    }
}
Discovered storage devices: [
    {'NAME': 'sda', 'MODEL': 'PERC_6/i', 'SERIAL': '6842b2b0740e9900260e66c9220df4ac'
    },
    {'NAME': 'sdb', 'MODEL': 'PERC_6/i', 'SERIAL': '6842b2b0740e9900260e66f924ecece0'
    },
    {'NAME': 'sr0', 'MODEL': 'TEAC_DVD-ROM_DV-28SW', 'SERIAL': '10092013112645'
    }
]
Discovered interfaces: {'xx: xx: xx: xx: xx: xx': 'eno1'
}
-----

You can see that it says the storage cannot be found and immediately lists it as a discovered device. It does it for both tests (one for each drive), and for both servers

Bug Details:
I had maas 2.4.x for the longest time over my journey (see below journey) and have never had any problems re-commissioning (or deleting and re-discovering over boot PXE) 2 of my servers (r610, r710).

r610 has an iPERC 6, four 10K X00GB drives configured in a RAID10, 1 virtual disk.
r710 has an iPERC 6, 6x 2TB drives, configured in a RAID10, 2 virtual disks

So commission after commission trying to get through my journey, 0 problems. After I finally get everything figured out on the juju, network/vlan, quad-nic end, I go to re-commission and I cannot. smartctl-validate fails on both, over and over again. I even destroyed and re-created the raid/VDs, still not.

After spending so much time on it I remembered that it was the first time I had tried to re-commission these two servers since doing an upgrade from 2.4.x->2.7 in an effort to use the updated KVM integration to add a couple more guests. Once I got all everything figured out I went to re-commission everything and boom.

[Upgrade path notes]
In full disclosure, in case this matters. I was on apt install of 2.4.x and using snap for 2.7, except it didn't work. So I read on how to do apt 2.7 and did that and did not uninstall snap 2.7 yet. I wanted to migrate from apt to snap but do not know how to without losing all maas data and could not find docs on it, so a problem for another day. But in case that is part of the problem for some odd reason, I wanted to share.

[Diatribe]
My journey to get maas+juju+openstack+kubernets has been less then stellar. I have ran into problem after problem; albeit some of which were my own. I am so close, after spending the last 6 months on/off when I had time, and really hardcore the last 4 days. The last half day of which has been this little gem. Maas has been pretty fun to work with but some thing have been the biggest pain in the a-hole to understand. Like un/managed subnets comes to mind: "Managed: we're going to use IPs, even with DHCP off. Unmanaged: We're still going to use IPs, but be different". Anyway, this doesn't belong here, if it gets modded out that's fine. It makes me feel a little better typing it knowing that I *think* my last problem was solved to get this up and running; just trying to contribute something that I can, back"

I did want to say thanks to those made/maintain maas. Despite the problems I somehow always run into I have enjoyed figuring it out.

-Red

Related branches

Revision history for this message
cees (red-f-unique-usernames) wrote :
Revision history for this message
Alberto Donato (ack) wrote :

Hi, can you please attach the output of the "50-maas-01-commissioning" commissioning script?

Changed in maas:
status: New → Incomplete
Revision history for this message
Alberto Donato (ack) wrote :

Also, can you please try commissioning using the "hwe-18.04-edge" kernel, see if it makes a difference?

Revision history for this message
cees (red-f-unique-usernames) wrote :

Attached script as requested.

I'm not sure how to commission with the kernel. If it is super important to try then I will try to help you out but I have just enough to make juju work with 4 nodes so taking one out of service is rough after getting past this buy just ovveriding the errors. :)

Revision history for this message
Alberto Donato (ack) wrote :

It seems the issue is that maas is detecting /dev/sd{a,b} as normal disks, but they're actually raids.
It shouldn't try to run smartctl-validate on those devices.

Changed in maas:
status: Incomplete → Triaged
Revision history for this message
Lee Trager (ltrager) wrote :

MAAS supports running SMART against RAID arrays as many pass through SMART information. Some RAID systems require additional tools installed, we support MegaRAID through this[1] script.

The issue is how the model name is being detected. In MAAS < 2.7 MAAS commissioning scripts, MAAS hardware tests, and Curtin all used lsblk. With MAAS 2.7 MAAS now gathers commissioning data from LXD while hardware testing and Curtin still use lsblk.

lsblk model name: PERC_6/i
LXD model name: PERC 6/i

Notice that lsblk uses an underscore while LXD uses a space. LXD needs to be consistent with lsblk to ensure compatibility.

[1] https://discourse.maas.io/t/running-smart-tests-against-megaraid-controllers/185

Revision history for this message
Lee Trager (ltrager) wrote :

I spoke with the LXD team and they're passing through the name they get from the kernel. It seems like this may be a bug that was introduced to util-linux with RAID devices. I tried running lsblk on Focal against two NVME drives and I do not see an underscore used in model names.

@cees - Can you try using a different commissioning operating system and see if the problem is resolved? You can download 16.04 and 20.04 on the images page and then change the commissioning operating system on the settings page.

no longer affects: util-linux-ng
Revision history for this message
cees (red-f-unique-usernames) wrote :

When running previous (18.xx LTS and 16.xx LTS) options in MAAS for commissioning it skipped the scripts because it said the drives do not support RAID. Perhaps that ties in with Alberto's comment. I'm not too familiar with SMART but I'm not sure why the drives don't support it, I thought they did. Or is that it is because it is behind a hardware raid controller that SMART cannot be exposed/ran to test on?

Maybe that answer is less relevant to this discussion, but I wanted to report back that running the other releases as requested passed, but due to SKIPPING the scripts.

Thanks and hope that helps,

-cees

Revision history for this message
Lee Trager (ltrager) wrote :

That makes sense based on the other information in this bug.

MAAS sends the test runner a list of storage devices to run hardware tests on. The test runner uses the model and serial to map to a block device name. This block device name is passed to the smartctl-validate script. smartctl-validate uses smartctl to determine if SMART data is available for the device.

Because the model name differs between lsblk and lxc it can't be looked up and the test is marked as a failure. The mapping needs to work so MAAS can pass the block device to to smartctl-validate which knows the test can be skipped.

tags: added: champagne
Revision history for this message
cees (red-f-unique-usernames) wrote :

Rereading my comment above I had a huge typo, I meant to say that the script output for 18/16 LTS said "the drivers do not support SMART", rather than use the word RAID, sorry about that confusion!

-cees

Alberto Donato (ack)
tags: removed: champagne
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in util-linux (Ubuntu):
status: New → Confirmed
Changed in lxd:
status: Unknown → Fix Released
Revision history for this message
Lee Trager (ltrager) wrote :

I was able to reproduce this bug by emulating an IDE in QEMU device and running the smartctl-validate test. I have filed an upstream bug as well.

https://github.com/karelzak/util-linux/issues/1098

Lee Trager (ltrager)
Changed in lxd:
status: Fix Released → Unknown
Changed in maas:
importance: Undecided → High
importance: High → Critical
milestone: none → 2.9.0b1
assignee: nobody → Lee Trager (ltrager)
Changed in lxd:
status: Unknown → New
Changed in maas:
status: Triaged → Fix Committed
Lee Trager (ltrager)
no longer affects: util-linux (Ubuntu)
Revision history for this message
Lee Trager (ltrager) wrote :

The LXD patch which has now been applied to 2.7, 2.8, and master should fix this for Xenial and Bionic. The bug may still be seen in Focal. I've opened LP:1888021 to fix that separately.

Changed in lxd:
status: New → Fix Released
Lee Trager (ltrager)
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
micah (micahliu) wrote :

Unable to run 'smartctl-validate': Storage device 'PERC H330 Mini' with serial '61866da05ab07e002788b69103f5305c' not found!
This indicates the storage device has been removed or the OS is unable to find it due to a hardware failure. Please re-commission this node to re-discover the storage devices, or delete this device manually.
Given parameters:
{'storage': {'argument_format': '{path}', 'type': 'storage', 'value': {'id_path': '/dev/disk/by-id/wwn-0x61866da05ab07e002788b69103f5305c', 'model': 'PERC H330 Mini', 'name': 'sda', 'physical_blockdevice_id': 7, 'serial': '61866da05ab07e002788b69103f5305c'}}}
Discovered storage devices:
[{'NAME': 'sda', 'MODEL': 'PERC_H330_Mini', 'SERIAL': '61866da05ab07e002788b69103f5305c'}]
Discovered interfaces:
{'44:a8:42:18:4b:9c': 'eno1'}

Revision history for this message
Tolga Kaprol (tolgakaprol) wrote :

We are also experiencing the same issue on MAAS 2.7 even on identical servers which has identical RAID cards.

Approx. 1/5 servers are passing the smartctl-validate tests.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.