maas commisioning fails

Bug #1534013 reported by Iben Rodriguez
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Won't Fix
Undecided
Unassigned

Bug Description

Steps to reproduce...
1 - setup a maas server and a empty node to pxeboot.
2 - start the commission process
3 - the empty node pxeboots and os installs
4 - once os installs ssh from maas controller to node works fine
5 - the node has internet access through maas as gateway and using maas dns services
6 - the node shows "commission failed" status

how can we troubleshoot this?

we tried with both maas version 1.8 and 1.9.

the environment is live on the internet if you can help us login and troubleshoot

I am working with Narinder Gupta from Canonical on this issue - he asked me to submit this issue.

ubuntu@maas:~$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===================================-================================-============-===============================================================================
ii maas 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS server provisioning libraries
ubuntu@maas:~$ zip -r ~/maas.zip /var/log/maas/*
  adding: var/log/maas/apache2/ (stored 0%)
  adding: var/log/maas/clusterd.log (deflated 86%)
  adding: var/log/maas/maas.log (stored 0%)
  adding: var/log/maas/maas.log.1 (deflated 94%)
  adding: var/log/maas/maas.log.2.gz (deflated 5%)
  adding: var/log/maas/maas.log.3.gz (deflated 7%)
  adding: var/log/maas/maas.log.4.gz (deflated 1%)
  adding: var/log/maas/maas.log.5.gz (deflated 4%)
  adding: var/log/maas/maas.log.6.gz (deflated 1%)
  adding: var/log/maas/proxy/ (stored 0%)
  adding: var/log/maas/regiond.log (deflated 96%)
  adding: var/log/maas/rsyslog/ (stored 0%)
ubuntu@maas:~$ ls -al /var/log/maas
total 9708
drwxr-xr-x 4 root root 4096 Jan 14 06:47 .
drwxrwxr-x 15 root syslog 4096 Jan 14 06:47 ..
lrwxrwxrwx 1 root root 16 Jan 6 17:38 apache2 -> /var/log/apache2
-rw-r--r-- 1 maas maas 8515 Jan 14 06:14 clusterd.log
-rw-r--r-- 1 syslog syslog 0 Jan 14 06:47 maas.log
-rw-r--r-- 1 syslog syslog 31775 Jan 14 06:44 maas.log.1
-rw-r--r-- 1 syslog syslog 1708 Jan 13 06:40 maas.log.2.gz
-rw-r--r-- 1 syslog syslog 1345 Jan 10 06:45 maas.log.3.gz
-rw-r--r-- 1 syslog syslog 1729 Jan 9 06:45 maas.log.4.gz
-rw-r--r-- 1 syslog syslog 1792 Jan 8 06:45 maas.log.5.gz
-rw-r--r-- 1 syslog syslog 6388 Jan 7 06:19 maas.log.6.gz
drwxr-x--- 2 proxy proxy 4096 Jan 14 06:47 proxy
-rw-r--r-- 1 maas maas 9848693 Jan 14 06:47 regiond.log
drwxr-xr-x 2 syslog syslog 4096 Jan 5 17:45 rsyslog
ubuntu@maas:~$ ubuntu@maas:~$

Revision history for this message
Iben Rodriguez (78luphr0rnk2nuqimstywepozxn9kl19tqh0tx66b5dki1xxsh5mkz9gl21a5rlwfnr8jn6-launchpad) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Iben,

In the regiond.log I see:

Jan 14 06:13:29 maas maas.node: [INFO] bootstrap: Status transition from FAILED_COMMISSIONING to COMMISSIONING
Jan 14 06:13:29 maas maas.power: [INFO] Changing power state (on) of node: bootstrap (node-689fa9b8-b490-11e5-a57b-2cc260100009)
Jan 14 06:13:29 maas maas.node: [INFO] bootstrap: Commissioning started
[...]
Jan 14 06:33:29 maas maas.node: [INFO] bootstrap: Status transition from COMMISSIONING to FAILED_COMMISSIONING
Jan 14 06:33:29 maas maas.node: [ERROR] bootstrap: Marking node failed: Node operation 'Commissioning' timed out after 0:20:00.

In the clusterd.log I see:

2016-01-14 06:13:41+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 19705): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2016-01-14 06:13:41+0000 [ClusterClient,client] RemoteOriginReadSession starting on 39738
2016-01-14 06:13:41+0000 [ClusterClient,client] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f7a3874fb90>
2016-01-14 06:13:41+0000 [RemoteOriginReadSession (UDP)] Final ACK received, transfer successful
2016-01-14 06:13:41+0000 [-] (UDP Port 39738 Closed)
2016-01-14 06:13:41+0000 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f7a3874fb90>
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49152): <RRQDatagram(filename=pxelinux.cfg/00000000-0000-0000-0000-000000000000, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49153): <RRQDatagram(filename=pxelinux.cfg/01-2c-c2-60-20-00-13, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49153): <RRQDatagram(filename=pxelinux.cfg/01-2c-c2-60-20-00-13, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49153): <RRQDatagram(filename=pxelinux.cfg/01-2c-c2-60-20-00-13, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2016-01-14 06:13:42+0000 [HTTPPageGetter,client] RemoteOriginReadSession starting on 42410
[...]
2016-01-14 06:13:42+0000 [RemoteOriginReadSession (UDP)] Final ACK received, transfer successful
2016-01-14 06:13:42+0000 [-] (UDP Port 42410 Closed)
2016-01-14 06:13:42+0000 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f7a397ba878>
2016-01-14 06:13:42+0000 [TFTP (UDP)] Datagram received from ('10.20.1.101', 49154): <RRQDatagram(filename=ubuntu/amd64/generic/trusty/daily/boot-kernel, mode=octet, options={'tsize': '0', 'blksize': '1408'})>

Which leads me to believe the machine, after set to COMMISSIONING, it tried to PXE boot, however, the commissioning process timeout out after 20 mins.

Things to consider:

Are you working behind a proxy or behind a firewall that might be preventing the macjhine from accessing the internet?

What's needed from you:

 - Console log, to see whether it is booting fine or where it is failing in the boot process/commissioning process.
 - Full node event log.

Changed in maas:
status: New → Incomplete
milestone: none → 2.0.0
milestone: 2.0.0 → 1.9.1
Changed in maas:
milestone: 1.9.1 → 1.9.2
Changed in maas:
milestone: 1.9.2 → 1.9.3
Changed in maas:
milestone: 1.9.3 → 1.9.4
Changed in maas:
milestone: 1.9.4 → 1.9.5
Revision history for this message
Andres Rodriguez (andreserl) wrote :

We believe that this is not longer an issue in the latest releases of MAAS. If you believe this is still an issue, please re-open this bug report and target it accordingly.

Changed in maas:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.