1.9 new disks not discovered by maas during recommissioning

Bug #1536233 reported by Larry Michel
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Critical
Unassigned

Bug Description

We have had a number of build failures due to system failing to deploy. It turns out that the commissioning data does not reflect new hardware despite recommissioning the system.

On one system for which new hardware was added, I retried to recommission the system multiple times but UI node view never showed the new disks. In fact, the old disk was replaced and 2 new disks added and maas still showed the old disks after recommissioning. Since deleting the node from maas, re-enlisting then recommissioning works and I can now see the new disks. Attaching old node view from deleted system and will attach new nodeview later.

and here's new disk info.

CPU 16 cores RAM 32GiB Storage 1200.0GB over 2 disks Operating SystemUbuntu 14.04 LTS "Trusty Tahr"Kerneltrusty (hwe-t)

I did verify that we hit bug 1534942 on this system, but since we also hit that other bug on system which did not have new hardware, I opening a new bug. Nonetheless, the 2 issues could be related.

Tags: oil
Revision history for this message
Larry Michel (lmic) wrote :
summary: - 1.9 new disk not discovered by maas during recommissioning
+ 1.9 new disks not discovered by maas during recommissioning
Revision history for this message
Larry Michel (lmic) wrote :

Note that I have recreated this on multiple systems and two maas servers at 1.9.0. Here's attached nodeview page for system that was deleted and re-enlisted/recommissioned.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Larry,

Did you, by any chance, clic on "Retain Storage Configuration" when trying to-recommissiong the machine?

Changed in maas:
milestone: none → 1.9.1
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

Hi Andres, No, the retain storage configuration was not selected.

Changed in maas:
status: Incomplete → New
Gavin Panella (allenap)
Changed in maas:
status: New → Triaged
Changed in maas:
importance: High → Critical
Revision history for this message
JuanJo Ciarlante (jjo) wrote :

FYI I seem to be facing the same issue (maas 1.9.0+bzr4533-0ubuntu1~trusty1).

Yesterday we swapped sda 1TB drive (unfortunately I don't have the previous one available for the serial#), I recommissioned the node but deployment fails with:

An error occured handling 'sda': ValueError - no disk with serial 'WD-WCATRC567667' found
no disk with serial 'WD-WCATRC567667' found
Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3

, but the node has no such model:

$ maas maas-root node details node-5d6b4822-51cb-11e5-a82c-00e0815b167e|egrep --text WD |xargs -l2
<product>WDC WD1002FAEX-0</product> <serial>WD-WCATRC568491</serial>
<product>WDC WD20EZRX-00D</product> <serial>WD-WMC4M3038307</serial>

(I'm positive this is the node id)

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

Follow up from comment #5 (btw erratum: s/such model/such serial), it's perplexing that the serial returned by maas-cli for the node: 'WD-WCATRC568491' is not in the DB anywhere, but instead the bogus one: 'WD-WCATRC567667'

$ pg_dump maasdb |grep WD-
11 WDC WD1002FAEX-0 WD-WCATRC567667
12 WDC WD20EZRX-00D WD-WMC4M3038307
15 WDC WD1002FAEX-0 WD-WCATRC387589
16 WDC WD1002FAEX-0 WD-WCATRC368653
5 WDC WD1002FAEX-0 WD-WCATRC568481
6 WDC WD1002FAEX-0 WD-WCATRC425546
13 WDC WD1002FAEX-0 WD-WCATRC585555
14 WDC WD20EZRX-00D WD-WMC4M3042500

, which is indeed confirmed by the following db query:

maasdb=# select n.id, d.id, d.name, p.model, p.serial from
  maasserver_node as n, maasserver_blockdevice as d, maasserver_physicalblockdevice as p
  where n.system_id='node-5d6b4822-51cb-11e5-a82c-00e0815b167e' and
  n.id=d.node_id and d.id=p.blockdevice_ptr_id;

 id | id | name | model | serial
----+----+------+------------------+-----------------
  1 | 11 | sda | WDC WD1002FAEX-0 | WD-WCATRC567667
  1 | 12 | sdb | WDC WD20EZRX-00D | WD-WMC4M3038307
(2 rows)

-> compare these two rows with the maas-cli output at #5

Ie looks like maas-cli is hitting some kinda cache(?), while the deploy scriptery is somehow getting the actual DB values.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

We do not pull the disk information from lshw. lshw is very unreliable and normally wrong. Check the output of the block device commissioning script. It will contain a JSON object that contains all of the storage information that MAAS injects into the database.
Check that the JSON object is what is reflected in the database. This file gets generated everytime you commission the node. MAAS will parse that JSON and update the database.

So the first step is to determine if that actually ran and got the correct results. Next is to see if that information updated the database correctly.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
JuanJo Ciarlante (jjo) wrote :

FYI at "Commissioning Output" (UI dropdown) file list there's no file alike 'block device', the only listed files are: http://paste.ubuntu.com/15208922/ , also FYI "Commissioning Summary" YAML does show correct serial#s.

Changed in maas:
status: Incomplete → Confirmed
Changed in maas:
milestone: 1.9.1 → 1.9.2
Revision history for this message
Jill Rouleau (jillrouleau) wrote :

I've had a similar issue, but without a hardware change. MAAS was upgraded from 1.8 to 1.9 with a running, juju-deployed environment. That environment was torn down and nodes recommissioned as-is in maas. 11 of 16 nodes failed deploy on at least one disk, with the UUID in question being found by the cli.

https://pastebin.canonical.com/151020/

After deleting all nodes and recommissioning everything is deployed - and on it's way to production so unfortunately I can't test further but can provide logs, db dumps, etc.

Gavin Panella (allenap)
Changed in maas:
status: Confirmed → Triaged
Changed in maas:
milestone: 1.9.2 → 1.9.3
Changed in maas:
milestone: 1.9.3 → 1.9.4
Changed in maas:
milestone: 1.9.4 → 1.9.5
Revision history for this message
Andres Rodriguez (andreserl) wrote :

11:13 < Beret> roaksoax, the bug is that if you swap a usb disk, recommissioning doesn't pick up the change
11:13 < Beret> but if you remove the usb disk, then commission, then insert, then re-commission, it's picked up

Revision history for this message
Andres Rodriguez (andreserl) wrote :

11:15 < roaksoax> Beret: ok, is this the exact same disk model and type ?
11:15 < roaksoax> Beret: or is it just a different disk altogether ?
11:15 < Beret> in the case where I just experienced it, it was the same disk model, type, and siz

Revision history for this message
Mike Pontillo (mpontillo) wrote :

This issue seems to be duplicate of bug #1575567, which was fixed in MAAS 1.9.3. We attempted to recreate the bug in the 2.1.1 branch and were not successful.

I see that the MAAS servers mentioned in this bug were on MAAS 1.9.0. Please update them to MAAS 1.9.3 or greater and try again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.