Commissioning x86_64 node never completes, sitting at grub prompt, pserv py tbs
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | MAAS |
Critical
|
Raphaël Badin | ||
| | 1.8 |
Critical
|
Raphaël Badin | ||
| | Trunk |
Critical
|
Raphaël Badin | ||
| | python-tx-tftp |
New
|
Undecided
|
Gavin Panella | |
| | python-tx-tftp (Ubuntu) |
Undecided
|
Unassigned | ||
| | Trusty |
Undecided
|
Unassigned | ||
| | Utopic |
Undecided
|
Unassigned | ||
| | Vivid |
Undecided
|
Unassigned | ||
Bug Description
[Impact]
When TFTP booting with UEFI, the TFTP server would stack trace when terminating the transfer. This would lead to some UEFI boot issues when using UEFI
[Test Case]
1. Install MAAS
2. Setup UEFI on machine to PXE boot from MAAS
3. UEFI boot machine, it will fail as tftp chrases.
4. With fix, UEFI boot machine, it will succeed as tftp doesn't crash.
[Regression Potential]
Minimal. This has tested and QA and proven to be working as expected.
ubuntu 14.04LTS + MaaS 1.5 on x86_64
Controller:
esxi vm xeon + vmnet3/ixgbe
Nodes:
supermicro twinblades
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
128GB RAM
2@ ige
2@ ixgbe <<< used for PXE booting
Trying to add physical nodes configured for Trusty Tahr amd64. IPMI powerctl cycles the node, tftp's two boot files, then commissioning goes out to lunch:
15:12:11.465976 IP 0.0.0.0.bootpc > 255.255.
15:12:11.468982 IP 172.30.
15:12:11.475270 IP 172.30.255.101.1294 > 172.30.193.38.tftp: 41 RRQ "bootx64.efi" octet tsize 0 blksize 1468
15:12:11.535326 IP 172.30.255.101.1295 > 172.30.193.38.tftp: 33 RRQ "bootx64.efi" octet blksize 1468
15:12:12.024716 IP 172.30.255.101.1296 > 172.30.193.38.tftp: 33 RRQ "/grubx64.efi" octet blksize 512
These tb's coincide with above traffic and node sitting at the grub prompt indefinitely:
2014-05-08 15:12:11-0700 [-] Starting protocol <tftp.bootstrap
2014-05-08 15:12:11-0700 [RemoteOriginRe
Traceback (most recent call last):
File "/usr/lib/
return context.
File "/usr/lib/
return self.currentCon
File "/usr/lib/
return func(*args,**kw)
File "/usr/lib/
why = selectable.doRead()
--- <exception caught here> ---
File "/usr/lib/
self.
File "/usr/lib/
datagram = TFTPDatagramFac
File "/usr/lib/
return datagram_
File "/usr/lib/
raise InvalidErrorcod
tftp.errors.
2014-05-08 15:12:11-0700 [RemoteOriginRe
2014-05-08 15:12:11-0700 [TFTP (UDP)] Datagram received from ('172.30.255.101', 1295): <RRQDatagram(
2014-05-08 15:12:11-0700 [TFTP (UDP)] Datagram received from ('172.30.255.101', 1295): <RRQDatagram(
2014-05-08 15:12:11-0700 [-] RemoteOriginRea
2014-05-08 15:12:11-0700 [-] RemoteOriginRea
2014-05-08 15:12:11-0700 [-] Starting protocol <tftp.bootstrap
2014-05-08 15:12:11-0700 [-] Starting protocol <tftp.bootstrap
2014-05-08 15:12:12-0700 [RemoteOriginRe
2014-05-08 15:12:12-0700 [RemoteOriginRe
2014-05-08 15:12:12-0700 [-] (UDP Port 43143 Closed)
2014-05-08 15:12:12-0700 [-] (UDP Port 43143 Closed)
2014-05-08 15:12:12-0700 [-] Stopping protocol <tftp.bootstrap
2014-05-08 15:12:12-0700 [-] Stopping protocol <tftp.bootstrap
2014-05-08 15:12:12-0700 [TFTP (UDP)] Datagram received from ('172.30.255.101', 1296): <RRQDatagram(
2014-05-08 15:12:12-0700 [TFTP (UDP)] Datagram received from ('172.30.255.101', 1296): <RRQDatagram(
2014-05-08 15:12:12-0700 [-] RemoteOriginRea
2014-05-08 15:12:12-0700 [-] RemoteOriginRea
2014-05-08 15:12:12-0700 [-] Starting protocol <tftp.bootstrap
2014-05-08 15:12:12-0700 [-] Starting protocol <tftp.bootstrap
2014-05-08 15:12:12-0700 [RemoteOriginRe
2014-05-08 15:12:12-0700 [RemoteOriginRe
2014-05-08 15:12:12-0700 [RemoteOriginRe
2014-05-08 15:12:12-0700 [RemoteOriginRe
2014-05-08 15:12:12-0700 [RemoteOriginRe
2014-05-08 15:12:12-0700 [RemoteOriginRe
2014-05-08 15:12:12-0700 [-] (UDP Port 56400 Closed)
2014-05-08 15:12:12-0700 [-] (UDP Port 56400 Closed)
2014-05-08 15:12:12-0700 [-] Stopping protocol <tftp.bootstrap
2014-05-08 15:12:12-0700 [-] Stopping protocol <tftp.bootstrap
2014-05-08 15:12:13-0700 [-] Unhandled Error
Traceback (most recent call last):
File "/usr/lib/
self.config, oldstdout, oldstderr, self.profiler, reactor)
File "/usr/lib/
reactor.run()
File "/usr/lib/
self.
File "/usr/lib/
self.
--- <exception caught here> ---
File "/usr/lib/
call.
File "/usr/lib/
self.
File "/usr/lib/
return self.socket.
exceptions.
2014-05-08 15:12:13-0700 [-] Logged OOPS id OOPS-4ad4c1419556eb88cc72311fd54f737b: AttributeError: 'Port' object has no attribute 'socket'
Nodes and controller are on the same untagged subnet but there is an lldp'd link between the bladeserver's onboard xgb switches and the controller's connected xgb Arista.
root@pre-
Desired=
| Status=
|/ Err?=(none)
||/ Name Version Architecture Description
+++-===
ii maas 1.5+bzr2252-
ii maas-cli 1.5+bzr2252-
ii maas-cluster-
ii maas-common 1.5+bzr2252-
ii maas-dhcp 1.5+bzr2252-
ii maas-dns 1.5+bzr2252-
ii maas-region-
ii maas-region-
ii python-django-maas 1.5+bzr2252-
ii python-maas-client 1.5+bzr2252-
ii python-
Repro:
This is a pretty standard initial configuration afaict, following the provided instructions. I notice there are no grub.cfg-* anywhere, only the grub.cfg template. Could that be why none of the nodes are doing anything once they're in the grub shell?
root@pre-
# MAAS GRUB2 pre-loader configuration file
# Load based on MAC address first.
configfile (pxe)/grub/
# Failed to load based on MAC address.
# Load amd64 by default, UEFI only supported by 64-bit
configfile (pxe)/grub/
root@pre-
total 4
-rw-r--r-- 1 root root 270 May 6 18:23 grub.cfg
root@pre-
/boot/grub/grub.cfg
/usr/share/
/var/lib/
Controller VM is connected to unrouted internal private network and external lab, which is not used by MaaS. Nodes are only connected to the private n/w. Controller is managing tftp, dhcp and dns and ip helper pointed to its private IP.
Nodes are configured for 'Default Ubuntu Release' Trusty Tahr. Boot images:
4 trusty amd64 generic commissioning release May 6, 2014, 6:23 p.m.
7 trusty amd64 generic install release May 6, 2014, 6:23 p.m.
3 trusty amd64 generic xinstall release May 6, 2014, 6:23 p.m.
5 trusty i386 generic commissioning release May 6, 2014, 6:23 p.m.
12 trusty i386 generic install release May 6, 2014, 6:23 p.m.
9 trusty i386 generic xinstall release May 6, 2014, 6:23 p.m.
6 precise amd64 generic commissioning release May 6, 2014, 6:23 p.m.
11 precise amd64 generic install release May 6, 2014, 6:23 p.m.
10 precise amd64 generic xinstall release May 6, 2014, 6:23 p.m.
2 precise i386 generic commissioning release May 6, 2014, 6:23 p.m.
8 precise i386 generic install release May 6, 2014, 6:23 p.m.
1 precise i386 generic xinstall release May 6, 2014, 6:23 p.m.
Related branches
- Gavin Panella (community): Approve on 2015-06-15
-
Diff: 110 lines (+49/-2)4 files modifiedsrc/provisioningserver/monkey.py (+13/-0)
src/provisioningserver/plugin.py (+5/-1)
src/provisioningserver/tests/test_monkey.py (+22/-1)
src/provisioningserver/tests/test_plugin.py (+9/-0)
- Raphaël Badin (community): Approve on 2015-06-15
-
Diff: 110 lines (+49/-2)4 files modifiedsrc/provisioningserver/monkey.py (+13/-0)
src/provisioningserver/plugin.py (+5/-1)
src/provisioningserver/tests/test_monkey.py (+22/-1)
src/provisioningserver/tests/test_plugin.py (+9/-0)
| Jason Brink (jason-brink) wrote : | #1 |
| Julian Edwards (julian-edwards) wrote : | #2 |
| tags: | added: server-hwe |
| Changed in maas: | |
| assignee: | nobody → Andres Rodriguez (andreserl) |
| status: | New → Triaged |
| importance: | Undecided → Critical |
| Blake Rouse (blake-rouse) wrote : | #3 |
Looks like grub is having an issue with either your network card or a miss routing of the network. Would it be possible to try and boot the machine using PXELINUX, instead of UEFI. If that works then we can confirm that it is a grubnetx64.efi issue, and can link it to that package.
| Changed in maas: | |
| status: | Triaged → Incomplete |
| assignee: | Andres Rodriguez (andreserl) → nobody |
| Launchpad Janitor (janitor) wrote : | #4 |
[Expired for MAAS because there has been no activity for 60 days.]
| Changed in maas: | |
| status: | Incomplete → Expired |
| Changed in maas: | |
| status: | Expired → Confirmed |
| Raphaël Badin (rvb) wrote : | #5 |
The stacktrace is there because MAAS' tftp implementation doesn't know about this error (it seems it's something new). But the original problem is that the TFTP request contains an option that isn't supported by the server. We need to figure out what this option is (a full tcpdump would be helpful here).
| Gavin Panella (allenap) wrote : | #6 |
In absence of a traffic dump, try applying e8.patch to the cluster:
cd /usr/share/
sudo patch -2 < .../e8.patch
sudo restart maas-clusterd
This should give us more information about the error being sent from the remote system.
| Gavin Panella (allenap) wrote : | #7 |
> cd /usr/share/
> sudo patch -2 < .../e8.patch
> sudo restart maas-clusterd
That should read:
cd /usr/share/
sudo patch -p2 < .../e8.patch
sudo restart maas-clusterd
| Changed in maas: | |
| milestone: | none → 1.7.2 |
| Patrick Mullaney (pm-mullaney) wrote : | #8 |
attached log showing original error and then a boot attempt with the patch applied(starting at 2015-02-03 17:40:17).
| Raphaël Badin (rvb) wrote : | #9 |
It seems the patch worked as expected:
2015-02-03 17:40:17+0000 [TFTP (UDP)] Datagram received from ('10.61.163.200', 1161): <RRQDatagram(
2015-02-03 17:40:17+0000 [TFTP (UDP)] Datagram received from ('10.61.163.200', 1161): <RRQDatagram(
2015-02-03 17:40:17+0000 [-] RemoteOriginRea
2015-02-03 17:40:17+0000 [-] RemoteOriginRea
2015-02-03 17:40:17+0000 [-] Starting protocol <tftp.bootstrap
2015-02-03 17:40:17+0000 [-] Starting protocol <tftp.bootstrap
2015-02-03 17:40:17+0000 [RemoteOriginRe
2015-02-03 17:40:17+0000 [RemoteOriginRe
First attempt to transfer bootx64.efi, fails with the "Terminate transfer due to option negotiation" error (no stacktrace thanks to the patch)
2015-02-03 17:40:17+0000 [-] (UDP Port 51426 Closed)
2015-02-03 17:40:17+0000 [-] (UDP Port 51426 Closed)
2015-02-03 17:40:17+0000 [-] Stopping protocol <tftp.bootstrap
2015-02-03 17:40:17+0000 [-] Stopping protocol <tftp.bootstrap
2015-02-03 17:40:17+0000 [TFTP (UDP)] Datagram received from ('10.61.163.200', 1162): <RRQDatagram(
2015-02-03 17:40:17+0000 [TFTP (UDP)] Datagram received from ('10.61.163.200', 1162): <RRQDatagram(
2015-02-03 17:40:17+0000 [-] RemoteOriginRea
2015-02-03 17:40:17+0000 [-] RemoteOriginRea
2015-02-03 17:40:17+0000 [-] Starting protocol <tftp.bootstrap
2015-02-03 17:40:17+0000 [-] Starting protocol <tftp.bootstrap
2015-02-03 17:40:17+0000 [RemoteOriginRe
2015-02-03 17:40:17+0000 [RemoteOriginRe
Second attempt to transfer the same file (bootx64.efi); this time the transfer is successful (note how the original request didn't contain the 'tsize' this time, probably because the value was '0' the first time, and that's why the first request failed)
| Gavin Panella (allenap) wrote : | #10 |
It's a shame we're not getting any details from the remote system as to why it doesn't like the negotiated options. Anyway, this seems to work (right?) so I'll get it landed.
| Gavin Panella (allenap) wrote : | #11 |
Filed upstream as https:/
(I can't set this as the bug URL for the python-tx-tftp task.)
| no longer affects: | python-tx-tftp |
| Changed in python-tx-tftp: | |
| assignee: | nobody → Gavin Panella (allenap) |
| Changed in maas: | |
| assignee: | nobody → Gavin Panella (allenap) |
| status: | Confirmed → In Progress |
| Gavin Panella (allenap) wrote : | #12 |
A fix has landed upstream (https:/
The attachment "Add error code 8 to python-tx-tftp." seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.
[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]
| tags: | added: patch |
| tags: | removed: patch |
| Changed in maas: | |
| assignee: | Gavin Panella (allenap) → Andres Rodriguez (andreserl) |
| Changed in maas: | |
| milestone: | 1.7.2 → 1.7.3 |
| Chirayu Patel (chirayup) wrote : | #14 |
I have the exact same issue when the controller is a VMWare ESX VM. Has this been fixed yet?
| Chirayu Patel (chirayup) wrote : | #15 |
I do not understand how some files are transferred and pxeliunux file fails
2015-06-08 22:33:31-0700 [TFTP (UDP)] Datagram received from ('192.168.2.100', 1436): <RRQDatagram(
2015-06-08 22:33:31-0700 [-] RemoteOriginRea
2015-06-08 22:33:31-0700 [-] Starting protocol <tftp.bootstrap
2015-06-08 22:33:32-0700 [RemoteOriginRe
2015-06-08 22:33:32-0700 [-] (UDP Port 34727 Closed)
2015-06-08 22:33:32-0700 [-] Stopping protocol <tftp.bootstrap
2015-06-08 22:33:33-0700 [TFTP (UDP)] Datagram received from ('192.168.2.100', 1437): <RRQDatagram(
2015-06-08 22:33:33-0700 [-] RemoteOriginRea
2015-06-08 22:33:33-0700 [-] Starting protocol <tftp.bootstrap
2015-06-08 22:33:33-0700 [RemoteOriginRe
2015-06-08 22:33:33-0700 [RemoteOriginRe
2015-06-08 22:33:34-0700 [RemoteOriginRe
2015-06-08 22:33:34-0700 [-] (UDP Port 44836 Closed)
2015-06-08 22:33:34-0700 [-] Stopping protocol <tftp.bootstrap
2015-06-08 22:33:34-0700 [-] Unhandled Error
Traceback (most recent call last):
File "/usr/lib/
File "/usr/lib/
File "/usr/lib/
File "/usr/lib/
--- <exception caught here> ---
File "/usr/lib/
File "/usr/lib/
File "/usr/lib/
return self.socket.
Jun 8 22:34:17 maas-poc maas.lease_
| Launchpad Janitor (janitor) wrote : | #16 |
Status changed to 'Confirmed' because the bug affects multiple users.
| Changed in python-tx-tftp (Ubuntu): | |
| status: | New → Confirmed |
| Launchpad Janitor (janitor) wrote : | #17 |
This bug was fixed in the package python-tx-tftp - 0.1~bzr38-0ubuntu4
---------------
python-tx-tftp (0.1~bzr38-
* debian/
used to terminate a transfer due to option negotiation.
See RFC 2347, "TFTP Option Extension". (LP: #1317705)
-- Andres Rodriguez <email address hidden> Mon, 22 Jun 2015 12:33:26 -0400
| Changed in python-tx-tftp (Ubuntu): | |
| status: | Confirmed → Fix Released |
| description: | updated |
Hello Jason, or anyone else affected,
Accepted python-tx-tftp into vivid-proposed. The package will build now and be available at https:/
Please help us by testing this new package. See https:/
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-
Further information regarding the verification process can be found at https:/
| Changed in python-tx-tftp (Ubuntu Utopic): | |
| status: | New → Won't Fix |
| Changed in python-tx-tftp (Ubuntu Vivid): | |
| status: | New → Fix Committed |
| tags: | added: verification-needed |
| Changed in python-tx-tftp (Ubuntu Trusty): | |
| status: | New → Fix Committed |
| Chris J Arges (arges) wrote : | #19 |
Hello Jason, or anyone else affected,
Accepted python-tx-tftp into trusty-proposed. The package will build now and be available at https:/
Please help us by testing this new package. See https:/
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-
Further information regarding the verification process can be found at https:/
| Andres Rodriguez (andreserl) wrote : | #20 |
This has been tested and verified. It works as expected. Marking it verification-done
| tags: |
added: verification-done removed: verification-needed |
| Launchpad Janitor (janitor) wrote : | #21 |
This bug was fixed in the package python-tx-tftp - 0.1~bzr38-
---------------
python-tx-tftp (0.1~bzr38-
* debian/
used to terminate a transfer due to option negotiation.
See RFC 2347, "TFTP Option Extension". (LP: #1317705)
-- Andres Rodriguez <email address hidden> Mon, 22 Jun 2015 12:33:26 -0400
| Changed in python-tx-tftp (Ubuntu Vivid): | |
| status: | Fix Committed → Fix Released |
The verification of the Stable Release Update for python-tx-tftp has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.
| Paul Gear (paulgear) wrote : | #23 |
@brian-murray, it appears this was not released into trusty-updates as mentioned above:
root@myserver:~$ apt-cache policy python-txtftp
python-txtftp:
Installed: 0.1~bzr38-
Candidate: 0.1~bzr38-
Version table:
*** 0.1~bzr38-
500 http://
100 /var/lib/
0.
500 http://
I've installed it from -proposed and confirmed that it fixes the oops on one system.
| Brian Murray (brian-murray) wrote : | #24 |
@paulgear - it wasn't released because of the other bug, LP: #1476175, which still needed verification at the time I made my comment.
| Launchpad Janitor (janitor) wrote : | #25 |
This bug was fixed in the package python-tx-tftp - 0.1~bzr38-
---------------
python-tx-tftp (0.1~bzr38-
* debian/
used to terminate a transfer due to option negotiation.
See RFC 2347, "TFTP Option Extension". (LP: #1317705)
python-tx-tftp (0.1~bzr38-
* debian/
counter back to 0 after it reaches 2^16. (LP: #1476175)
-- Andres Rodriguez <email address hidden> Mon, 22 Jun 2015 12:33:26 -0400
| Changed in python-tx-tftp (Ubuntu Trusty): | |
| status: | Fix Committed → Fix Released |
| no longer affects: | maas/1.7 |


Looks related to the recent UEFI work, passing over to Andres.