HP ProLiant DL380 G7 tftps kernel, but initrd tracebacks in tftp server. DL380 G6 succeeds.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Invalid
|
High
|
Unassigned | ||
python-tx-tftp (Ubuntu) |
Fix Released
|
Undecided
|
Andres Rodriguez |
Bug Description
I've had a MAAS running with the Precise SRU PPA for a short while now, and it successfully controlled an HP ProLiant DL380 G6 consistently and repeatedly. Unfortunately when I sought to replace this with a DL380 G7 that has more resources, I found the commissioning process hangs in precisely the same spot nearly every time.
I'm running my own DHCP server, as well as DNS.
The G7 has successfully booted the commissioning instance a grand total of twice in the past 48 hours, and I've left it to boot-loop for extended periods (though not the whole 48 hours, as that's a bit rough on the hardware). It nearly always successfully TFTPs the kernel, but the request for the initrd times out, and pxelinux reboots with the "Boot failed: press a key to retry, or wait for reset..." message.
I have tried this with precise and quantal commissioning images, with no noticeable difference.
Running tcpdump shows that the kernel RRQ proceeds fine, although tcpdump is convinced that the checksums are mostly wrong on the sent packets. The tftp ACK-like packets make it back, though, and the next chunk seems to be sent, and pxelinux seems to be fine with it.
But when the initrd is read, we get this:
10:23:27.195148 IP (tos 0x0, ttl 64, id 3701, offset 0, flags [none], proto UDP (17), length 103)
10.
10:23:27.196704 IP (tos 0x0, ttl 64, id 12499, offset 0, flags [DF], proto UDP (17), length 58)
10.
10:23:28.196744 IP (tos 0x0, ttl 64, id 12500, offset 0, flags [DF], proto UDP (17), length 58)
10.
10:23:31.198995 IP (tos 0x0, ttl 64, id 12501, offset 0, flags [DF], proto UDP (17), length 58)
10.
10:23:31.971823 IP (tos 0x0, ttl 64, id 3702, offset 0, flags [none], proto UDP (17), length 32)
10.
This is pretty much always what happens. My node requests the initrd (quantal, in this case), and instead of the usual full-frame 1404-byte packet we get a few 30-byte messages with the wrong checksum (and tcpdump's checksum shows that these are identical payloads), and then the whole thing times out.
This of course corresponds with an OOPS:
-rw-r--r-- 1 maas maas 1574 Mar 15 10:23 OOPS-a085f6f73404bd8d66b7029e10e42081
& time�p�
Traceback (most recent call last):
File "/usr/lib/
return context.
File "/usr/lib/
return self.currentCon
File "/usr/lib/
return func(*args,**kw)
File "/usr/lib/
why = selectable.doRead()
--- <exception caught here> ---
File "/usr/lib/
self.
File "/usr/lib/
return self._datagramR
File "/usr/lib/
return self.tftp_
File "/usr/lib/
self.
File "/usr/lib/
raise Spent("This SequentialCall has already timed out")
tftp.util.Spent: This SequentialCall has already timed out
(apologies if any important binary data is lost here. I can attach it if the format is significant. I figured the traceback alone ought to be enough)
As I watch the logs continue now, I notice that tcpdump is unhappy with the checksums as a whole. Here's the request for the cpu-arch testing program and the start of the kernel download:
10:35:22.521068 IP (tos 0x0, ttl 64, id 31, offset 0, flags [none], proto UDP (17), length 69)
10.
10:35:22.522721 IP (tos 0x0, ttl 64, id 60258, offset 0, flags [DF], proto UDP (17), length 54)
10.
10:35:22.522822 IP (tos 0x0, ttl 64, id 32, offset 0, flags [none], proto UDP (17), length 32)
10.
10:35:22.523083 IP (tos 0x0, ttl 64, id 60259, offset 0, flags [DF], proto UDP (17), length 1344)
10.
10:35:22.523235 IP (tos 0x0, ttl 64, id 33, offset 0, flags [none], proto UDP (17), length 32)
10.
10:35:22.523253 IP (tos 0x0, ttl 64, id 34, offset 0, flags [none], proto UDP (17), length 99)
10.
10:35:22.525739 IP (tos 0x0, ttl 64, id 60259, offset 0, flags [DF], proto UDP (17), length 57)
10.
10:35:22.525812 IP (tos 0x0, ttl 64, id 35, offset 0, flags [none], proto UDP (17), length 32)
10.
10:35:22.526046 IP (tos 0x0, ttl 64, id 60260, offset 0, flags [DF], proto UDP (17), length 1432)
10.
10:35:22.526203 IP (tos 0x0, ttl 64, id 36, offset 0, flags [none], proto UDP (17), length 32)
10.
Is it possible that the tftpserver in the new maas is doing something funny with the checksums, but that most of the time nobody verifies them?
Related branches
Changed in maas: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in python-tx-tftp (Ubuntu): | |
assignee: | nobody → Andres Rodriguez (andreserl) |
Changed in maas: | |
status: | Triaged → Invalid |
TIL: tcpdumping TSO-enabled hardware will always show bogus checksums, because the real checksumming is done in hardware in fun zero-copy ways.
The OOPS and the 30-byte replies from the TFTP server still seem significant, however.