Commissioning fails on NUCS previously loaded with coreos

Bug #1645872 reported by Bob Wise
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Incomplete
Undecided
Unassigned

Bug Description

We have a bunch of NUCs in our lab. Some number of them always failing to commission with otherwise identical bios, settings, and AMT configuration, all in the same rack on the same controller on the same switch.

Resolution that works 100% is to live-boot via usb 16.04, and then to wipe the start of the SSD with "dd if=/dev/zero of=/dev/sda blocksize=1M count=1", although I'm pretty sure a much smaller blocksize would get the job done.

The only common thread among these NUCs is they previously had coreos installed on them.

Once the SSD is wiped, commissioning works from that point onward.
Seems like this is something the initial MAAS boot image should or could do by way of hygiene.

Very little debug is provided in the MAAS UI by way of the actual commissioning failure.
=================
/var/log during commission failed (nuc 17 is our "coreos was installed last" test case:

Nov 29 13:36:52 nuc-20 maas.node: [info] nuc-17: Status transition from FAILED_COMMISSIONING to COMMISSIONING
Nov 29 13:36:52 nuc-20 maas.power: [info] Changing power state (on) of node: nuc-17 (6nhkyc)
Nov 29 13:36:52 nuc-20 maas.node: [info] nuc-17: Commissioning started
Nov 29 13:38:06 nuc-20 maas.power: [info] Changed power state (on) of node: nuc-17 (6nhkyc)
Nov 29 13:39:12 nuc-20 maas.power: [info] nuc-17: Power state has changed from on to off.
Nov 29 13:50:10 nuc-20 maas.import-images: [info] Started importing boot images.
Nov 29 13:50:10 nuc-20 maas.import-images: [info] Downloading image descriptions from http://localhost:5240/MAAS/images-stream/streams/v1/index.json
Nov 29 13:50:11 nuc-20 maas.import-images: [info] Updating boot image iSCSI targets.
Nov 29 13:50:11 nuc-20 maas.import-images: [info] Finished importing boot images, the region does not have any new images.
Nov 29 13:57:55 nuc-20 maas.node: [error] nuc-17: Marking node failed: Machine operation 'Commissioning' timed out after 20 minutes.
Nov 29 13:57:55 nuc-20 maas.node: [info] nuc-17: Status transition from COMMISSIONING to FAILED_COMMISSIONING=================

dpkg output...
------------
ii maas 2.1.1+bzr5544-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.1.1+bzr5544-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.1.1+bzr5544-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.1.1+bzr5544-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.1.1+bzr5544-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.1.1+bzr5544-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.1.1+bzr5544-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.1.1+bzr5544-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.1.1+bzr5544-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.1.1+bzr5544-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.1.1+bzr5544-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.1.1+bzr5544-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Bob,

Could you please attempt a commissioning and attach:

1. # for the specific commissioning action you started.
/var/log/maas/rsyslog/<machine-name>/<date>/messages

2. Enable SSH action during commissioning and SSH into the machine and grab:
/var/log/cloud-init{-output}.log

3. Provide the machine's node event log.
Go to the MAAS WebUI, go to the machine's details page, go to "Latest node events", click on "View full history".

Thanks.

Changed in maas:
status: New → Incomplete
Revision history for this message
Bob Wise (countspongebob) wrote :

MAAS UI log from failed provisioning attached.

Revision history for this message
Bob Wise (countspongebob) wrote :

messages logfile from MAAS server, as requested

Revision history for this message
Bob Wise (countspongebob) wrote :

Multiple attempts to capture the log from the instance have failed. In each case I check the "enable SSH and prevent shutdown" option, I can observe the "booting under MAAS direction" output from the console, and the system boots with lots of console output and after a couple of minutes shuts down again.

Revision history for this message
Bob Wise (countspongebob) wrote :

I can see one line flashing passed "failed" near the end of the cycle. Using slomotion capture (yeah, not kidding!) from my phone, I can see that the "FAILED" message is "Failed unmounting /lib/modules."

Revision history for this message
Bob Wise (countspongebob) wrote :

I could probably provide a video of the console output as an attachment if there is interest/need.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.