Commission scripts select the wrong nvme device link, then fails to report any storage

Bug #1665143 reported by John George
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Unassigned
2.1
Fix Released
High
Andres Rodriguez
systemd (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

The udev package provides /lib/udev/rules.d/60-persistent-storage.rules which creates two symlinks for nvme devices, under /dev/disk/by-id/. The first link name includes the device wwid and the second includes the device model/serial. The commission script selects the first link discovered and subsequently attempts to store it in a FilePath field, which allows for 100 characters. Since the wwid link is greater than 100 characters an exception is thrown, causing not only the nvme device not to be registered but all other storage devices as well. Although commissioning completes there is no storage assigned, which makes deployment of the node impossible.

This issue has blocked all test runs performed by the CDO-QA test infrastructure, since every run installs MAAS on a fresh machine and commissions new nodes. The failure is seen when installing from either ppa:maas/next (2.2.0~beta2) or ppa:maas/stable (2.1.3+bzr5573).

ubuntu@meowth:~$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-====================================-============-=================================================
ii maas 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.2.0~beta2+bzr5717-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

After re-commissioning one of the servers with ssh enabled the attached log files were collected. Please note that from the shell it can be seen that block devices are discovered and even the commissioning output found in /tmp/user_data.sh.IK9yVp/out/00-maas-07-block-devices lists devices (see attached), where-as this file is shown as a 0 byte file from the GUI (see screen shot).

There are 'HTTP Error 500: INTERNAL SERVER ERROR' errors in cloud-init-output.log

ubuntu@azurill:~$ uname -a
Linux azurill 4.8.0-34-generic #36~16.04.1-Ubuntu SMP Wed Dec 21 18:55:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@azurill:~$ sudo lsblk --exclude 1,2,7 -d -P -o NAME,RO,RM,MODEL,ROTA
NAME="sdb" RO="0" RM="0" MODEL="LOGICAL VOLUME " ROTA="1"
NAME="sdc" RO="1" RM="0" MODEL="VIRTUAL-DISK " ROTA="1"
NAME="sda" RO="0" RM="0" MODEL="LOGICAL VOLUME " ROTA="1"
NAME="nvme0n1" RO="0" RM="0" MODEL="INTEL SSDPEDME400G4

Tags: cdo-qa sts

Related branches

Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (4.7 KiB)

Frmo the logs above:

2017-02-15 20:22:04 maasserver: [error] ################################ Exception: value too long for type character varying(100)
 ################################
2017-02-15 20:22:04 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
psycopg2.DataError: value too long for type character varying(100)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 132, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 177, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.5/contextlib.py", line 30, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 59, in __call__
    response = upcall(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 21, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/piston3/resource.py", line 190, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3/dist-packages/piston3/resource.py", line 188, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 298, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 704, in signal
    target_status = process(node, request, status)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 567, in _process_commissioning
    node, node.current_commissioning_script_set, request, status)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 532, in _store_results
    script_result.store_result(**args)
  File "/usr/lib/python3/dist-packages/metadataserver/models/scriptresult.py", line 144, in store_result
    exit_status=self.exit_status)
  File "/usr/lib/python3/dist-packages/metadataserver/models/commissioningscript.py", line 348, in update_node_physical_block_devices
    serial=serial,
  File "/usr/lib/python3/dist-packages/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/db/models/query.py", line 348, in create
    obj.save(force_insert=True, using=self.db)
  File "/usr/lib/python3/dist-packages/maasserver/models/cleansave.py", line 29, in save
    return super(CleanSave, self).save(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/models/timestampedmodel.py", line 72, in save
    return super(TimestampedModel, self).save(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/db/models/base.py", line 734, in save
    force_update=force_update, update_fields=update_fields)
  File "/usr/lib/python3/dist-packages/django/db/models/base.py"...

Read more...

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Ok, So the reason for this is that MAAS handles a FilePath of maximum 100 characters. The one being found here is way more than that.

The interesting thing is that Ubuntu has changed the lenght of this file path and this could cause regressions not just in MAAS but in other places. IMHO, this needs to be looked at in Ubuntu as well.

Changed in maas:
milestone: none → 2.2.0
status: New → Confirmed
Revision history for this message
Lee Trager (ltrager) wrote :

In 00-maas-07-block-devices the Intel NVME device has an ID_PATH which is 115 characters long. We use a Django FilePathField on the BlockDevice model to store the device file path. Django limits the FilePathField to 100 characters by default. The result is being rejected because the file path is currently to long for the field.

https://docs.djangoproject.com/en/1.10/ref/models/fields/#filepathfield

Chris Gregan (cgregan)
tags: added: cdo-qa cdo-qa-blocker
removed: cdoqa-blocker
Revision history for this message
John George (jog) wrote :

When commissioning with the HWE kernel the ID_PATH is shorter:

{
  "MODEL": "INTEL SSDPEDME400G4",
  "SERIAL": "CVMD434500BN400AGN",
  "ROTA": "0",
  "ID_PATH": "/dev/disk/by-id/nvme-INTEL_SSDPEDME400G4_CVMD434500BN400AGN",
  "PATH": "/dev/nvme0n1",
  "NAME": "nvme0n1",
  "BLOCK_SIZE": "4096",
  "RO": "0",
  "RM": "0",
  "SIZE": "400088457216"
 }

Changed in maas:
status: Confirmed → Fix Committed
Revision history for this message
John George (jog) wrote :

I'm seeing the same issue in 2.1.3+bzr5573, with both the default or HWE commissioning kernel.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

We cannot fix this in 2.1 provided that this requires a migration and we cannot backport migrations. It would be interesting to see what created this and how we can address it.

The issue is that something changed in the OS as this was previously working.

Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :

When ssh'ing into the commissioning machine I can see that there are two links to nvme0n1. It seems like the first is the one that MAAS really needs to use. Does the commissioning script need to filter?

ubuntu@azurill:~$ ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 13 Feb 22 22:15 nvme-INTEL_SSDPEDME400G4_CVMD434500BN400AGN -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Feb 22 22:15 nvme-nvme.8086-43564d44343334353030424e34303041474e-494e54454c205353445045444d453430304734-00000001 -> ../../nvme0n1
lrwxrwxrwx 1 root root 9 Feb 22 22:15 scsi-360000000000000000e000000000f0001 -> ../../sdc
lrwxrwxrwx 1 root root 9 Feb 22 22:15 scsi-3600508b1001c29aef10ecd96f5df2342 -> ../../sdb
lrwxrwxrwx 1 root root 9 Feb 22 22:15 scsi-3600508b1001cd63f5d20ab8562112349 -> ../../sda
lrwxrwxrwx 1 root root 10 Feb 22 22:15 scsi-3600508b1001cd63f5d20ab8562112349-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Feb 22 22:15 scsi-3600508b1001cd63f5d20ab8562112349-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 Feb 22 22:15 wwn-0x60000000000000000e000000000f0001 -> ../../sdc
lrwxrwxrwx 1 root root 9 Feb 22 22:15 wwn-0x600508b1001c29aef10ecd96f5df2342 -> ../../sdb
lrwxrwxrwx 1 root root 9 Feb 22 22:15 wwn-0x600508b1001cd63f5d20ab8562112349 -> ../../sda
lrwxrwxrwx 1 root root 10 Feb 22 22:15 wwn-0x600508b1001cd63f5d20ab8562112349-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Feb 22 22:15 wwn-0x600508b1001cd63f5d20ab8562112349-part2 -> ../../sda2

Revision history for this message
Andres Rodriguez (andreserl) wrote :

This is the script that gathers the block devices. I looked into it and it seems that it is doing the right thing as it is getting the by-id link based on the /dev/nvme* path.

http://paste.ubuntu.com/24049424/

Revision history for this message
John George (jog) wrote :

Here is the output from running that script.
http://paste.ubuntu.com/24049463/

Revision history for this message
Andres Rodriguez (andreserl) wrote :

I've attache a work around path in the bug, but it is an ugly work around. That said, I noticed some strange behavior. Links format are completely different to how the links format is handled for scsi devices:

For example: 'sdb' we see two links:

  Checking that path [/dev/sdb] and is same as link [wwn-0x600508b1001c29aef10ecd96f5df2342]
  Checking that path [/dev/sdb] and is same as link [scsi-3600508b1001c29aef10ecd96f5df2342]

For 'nvme' we also see two with completely different format:

  Checking that path [/dev/nvme0n1] and is same as link [nvme-nvme.8086-43564d44343334353030424e34303041474e-494e54454c205353445045444d453430304734-00000001]
  Checking that path [/dev/nvme0n1] and is same as link [nvme-INTEL_SSDPEDME400G4_CVMD434500BN400AGN]

Revision history for this message
Dan Streetman (ddstreet) wrote :

> Checking that path [/dev/nvme0n1] and is same as link [nvme-nvme.8086-
> 43564d44343334353030424e34303041474e-494e54454c205353445045444d453430304734-00000001]

this is the wwid symlink, created by:
SYMLINK+="disk/by-id/nvme-$attr{wwid}"

> Checking that path [/dev/nvme0n1] and is same as link [nvme-
> INTEL_SSDPEDME400G4_CVMD434500BN400AGN]

this is the model/serial symlink, created by:
ENV{ID_SERIAL}="$attr{model}_$env{ID_SERIAL_SHORT}", SYMLINK+="disk/by-id/nvme-$env{ID_SERIAL}"

> This is the script that gathers the block devices. I looked into it and it seems that it is
> doing the right thing as it is getting the by-id link based on the /dev/nvme* path.

    def _path_to_idpath(path):
        """Searches dev_disk_byid for a device symlinked to /dev/[path]"""
        if os.path.exists(dev_disk_byid):
            for link in os.listdir(dev_disk_byid):
                if os.path.exists(path) and os.path.samefile(
                        os.path.join(dev_disk_byid, link), path):
                    return os.path.join(dev_disk_byid, link)
        return None

this just searches all the symlinks in /dev/disk/by-id/ and uses the first one that points to the target device or partition. That's fine if you don't actually care about which of the symlinks you use.

> The result is being rejected because the file path is currently to long for the field.

well the max path size needs to be increased then.

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1665143] Re: commissioning does not discover block devices on HP ProLiant DL360 Gen9 servers

Hi Dan,

Thanks for the explanation. Question inline:

On Thu, Feb 23, 2017 at 9:26 AM, Dan Streetman <
<email address hidden>> wrote:

> > Checking that path [/dev/nvme0n1] and is same as link [nvme-nvme.8086-
> > 43564d44343334353030424e34303041474e-494e54454c205353445045444d4534
> 30304734-00000001]
>
> this is the wwid symlink, created by:
> SYMLINK+="disk/by-id/nvme-$attr{wwid}"
>
> > Checking that path [/dev/nvme0n1] and is same as link [nvme-
> > INTEL_SSDPEDME400G4_CVMD434500BN400AGN]
>
> this is the model/serial symlink, created by:
> ENV{ID_SERIAL}="$attr{model}_$env{ID_SERIAL_SHORT}",
> SYMLINK+="disk/by-id/nvme-$env{ID_SERIAL}"
>

What seems strange to me is that both links above completely differ in
format from those of a 'scsi' device, which is why I was comparing these to:

>
>
> > This is the script that gathers the block devices. I looked into it and
> it seems that it is
> > doing the right thing as it is getting the by-id link based on the
> /dev/nvme* path.
>
> def _path_to_idpath(path):
> """Searches dev_disk_byid for a device symlinked to /dev/[path]"""
> if os.path.exists(dev_disk_byid):
> for link in os.listdir(dev_disk_byid):
> if os.path.exists(path) and os.path.samefile(
> os.path.join(dev_disk_byid, link), path):
> return os.path.join(dev_disk_byid, link)
> return None
>
>
> this just searches all the symlinks in /dev/disk/by-id/ and uses the first
> one that points to the target device or partition. That's fine if you
> don't actually care about which of the symlinks you use.
>
> > The result is being rejected because the file path is currently to
> long for the field.
>
> well the max path size needs to be increased then.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1665143
>
> Title:
> commissioning does not discover block devices on HP ProLiant DL360
> Gen9 servers
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1665143/+subscriptions
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Andres Rodriguez (andreserl) wrote :

On Thu, Feb 23, 2017 at 9:26 AM, Dan Streetman <
<email address hidden>> wrote:

> > Checking that path [/dev/nvme0n1] and is same as link [nvme-nvme.8086-
> > 43564d44343334353030424e34303041474e-494e54454c205353445045444d4534
> 30304734-00000001]
>
> this is the wwid symlink, created by:
> SYMLINK+="disk/by-id/nvme-$attr{wwid}"
>
> > Checking that path [/dev/nvme0n1] and is same as link [nvme-
> > INTEL_SSDPEDME400G4_CVMD434500BN400AGN]
>
> this is the model/serial symlink, created by:
> ENV{ID_SERIAL}="$attr{model}_$env{ID_SERIAL_SHORT}",
> SYMLINK+="disk/by-id/nvme-$env{ID_SERIAL}"
>
What seems strange to me is that both links above completely differ in
format from those of a 'scsi' device, which is why I was comparing these to:

For example:

 sdb's [wwn-0x600508b1001c29aef10ecd96f5df2342] vs nvme's [nvme-nvme.
8086-43564d44343334353030424e34303041474e-494e54454c205353445045444d4534
30304734-00000001]

or

sdb's [scsi-3600508b1001c29aef10ecd96f5df2342] vs nmve's [nvme-INTEL_
SSDPEDME400G4_CVMD434500BN400AGN].

This seems a bit inconsistent to me, for example the symlink for sdb's does
not include Model/Serial, where as they do for nvme's. That alone is a
format difference on how the symlinks are being created and which is why I
was asking, as it seems weird to me.

Thanks.

>
>
> > This is the script that gathers the block devices. I looked into it and
> it seems that it is
> > doing the right thing as it is getting the by-id link based on the
> /dev/nvme* path.
>
> def _path_to_idpath(path):
> """Searches dev_disk_byid for a device symlinked to /dev/[path]"""
> if os.path.exists(dev_disk_byid):
> for link in os.listdir(dev_disk_byid):
> if os.path.exists(path) and os.path.samefile(
> os.path.join(dev_disk_byid, link), path):
> return os.path.join(dev_disk_byid, link)
> return None
>
>
> this just searches all the symlinks in /dev/disk/by-id/ and uses the first
> one that points to the target device or partition. That's fine if you
> don't actually care about which of the symlinks you use.
>
> > The result is being rejected because the file path is currently to
> long for the field.
>
> well the max path size needs to be increased then.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1665143
>
> Title:
> commissioning does not discover block devices on HP ProLiant DL360
> Gen9 servers
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1665143/+subscriptions
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

John George (jog)
description: updated
description: updated
summary: - commissioning does not discover block devices on HP ProLiant DL360 Gen9
- servers
+ Commission scripts select the wrong nvme device link, then fail to
+ report any storage
Revision history for this message
John George (jog) wrote : Re: Commission scripts select the wrong nvme device link, then fail to report any storage

Please mark this bug critical, as it blocks all integration testing.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> This seems a bit inconsistent to me, for example the symlink for sdb's does
> not include Model/Serial, where as they do for nvme's. That alone is a
> format difference on how the symlinks are being created and which is why I
> was asking, as it seems weird to me.

there's no required format for any of the /dev/disk/by-id/ symlinks that I'm aware of.

tags: added: sts
John George (jog)
summary: - Commission scripts select the wrong nvme device link, then fail to
+ Commission scripts select the wrong nvme device link, then fails to
report any storage
Revision history for this message
Andres Rodriguez (andreserl) wrote :

I've attached a patch that can be applied in 2.1. https://code.launchpad.net/~andreserl/maas/lp1665143 This should fix the issue. I'd request everybody having this issue to please test it.

Revision history for this message
John George (jog) wrote :

I applied the patch to /usr/lib/python3/dist-packages/provisioningserver/refresh/node_info_scripts.py on our failing maas and restarted maas-regiond.service (sudo systemctl status maas-regiond.service).

Storage was reported as expected, after re-commission of existing nodes; I also removed a node, re-enlisted and commissioned. The test was done with both the default and HWE 16.06 commissioning kernels.

Changed in maas:
importance: Undecided → High
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in udev (Ubuntu):
status: New → Confirmed
Chris Gregan (cgregan)
tags: removed: cdo-qa-blocker
affects: udev (Ubuntu) → systemd (Ubuntu)
Revision history for this message
Dan Streetman (ddstreet) wrote :

marking as invalid for systemd as i believe the bug was in maas, per previous comments.

Changed in systemd (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.