ironic-ipxe looking for wrong file

Bug #1990028 reported by Cristian Le
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Invalid
Undecided
Unassigned
tripleo
New
Undecided
Unassigned

Bug Description

This is a very perplexing error. Ironic-pxe-tftboot for some reason is asking for the wrong files, i.e. the log file can look like such:
```
Sep 17 10:32:24 dnsmasq-tftp[2]: sent /var/lib/ironic/tftpboot/snponly.efi to 192.168.5.203
Sep 17 10:45:13 dnsmasq-tftp[2]: file /var/lib/ironic/tftpboot/snponly.efiIy d not found
Sep 17 11:55:15 dnsmasq-tftp[2]: error 8 User aborted the transfer received from 192.168.5.201
Sep 17 11:55:15 dnsmasq-tftp[2]: sent /var/lib/ironic/tftpboot/snponly.efi to 192.168.5.201
Sep 17 11:55:15 dnsmasq-tftp[2]: sent /var/lib/ironic/tftpboot/snponly.efi to 192.168.5.201
Sep 17 13:01:04 dnsmasq[2]: exiting on receipt of SIGTERM
Sep 17 13:01:09 dnsmasq[2]: started, version 2.85 DNS disabled
Sep 17 13:01:09 dnsmasq[2]: compile time options: IPv6 GNU-getopt DBus no-UBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile
Sep 17 13:01:09 dnsmasq-tftp[2]: TFTP root is /var/lib/ironic/tftpboot
Sep 17 13:50:46 dnsmasq-tftp[2]: file /var/lib/ironic/tftpboot/snponly.efiIy e not found
Sep 17 13:52:46 dnsmasq-tftp[2]: file /var/lib/ironic/tftpboot/snponly.efiIy e not found
Sep 17 14:37:38 dnsmasq[2]: exiting on receipt of SIGTERM
Sep 17 14:37:39 dnsmasq[2]: started, version 2.85 DNS disabled
Sep 17 14:37:39 dnsmasq[2]: compile time options: IPv6 GNU-getopt DBus no-UBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth cryptohash DNSSEC loop-detect inotify dumpfile
Sep 17 14:37:39 dnsmasq-tftp[2]: TFTP root is /var/lib/ironic/tftpboot
Sep 17 14:38:41 dnsmasq-tftp[2]: file /var/lib/ironic/tftpboot/ipxe.efiIy e not found
Sep 17 14:42:34 dnsmasq-tftp[2]: file /var/lib/ironic/tftpboot/snponly.efiIy e not found
Sep 17 14:46:31 dnsmasq-tftp[2]: file /var/lib/ironic/tftpboot/snponly.efiIy e not found
```
The first section there was when ironic-inspector was booting and it seemed to request the correct file sometimes. But the latter is when doing `cleaning` and it consistently fails. Notice the additional `Iy e` for the files it looks. Creating that file does not help though.

Note:
- Running with tls-e and ironic-inspector

Revision history for this message
Cristian Le (lecris) wrote :

More debug data, here is the tcpdump. Top one is with ironic-inspector, and bottom one that is broken is cleaning:
```
sudo tcpdump -i vlan50 port 67 or port 68 or port 69 or port 4011
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vlan50, link-type EN10MB (Ethernet), snapshot length 262144 bytes
02:44:22.791779 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:d5:bf:28 (oui Unknown), length 347
02:44:25.812748 IP controller-0.provisioning.openstack.lab.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 300
02:44:26.646082 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:d5:bf:28 (oui Unknown), length 359
02:44:26.664178 IP controller-0.provisioning.openstack.lab.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 300
02:44:26.670826 IP 192.168.5.201.remote-as > saruman-0.provisioning.openstack.lab.tftp: TFTP, length 41, RRQ "snponly.efi" octet tsize 0 blksize 1468
02:44:26.763110 IP 192.168.5.201.brvread > saruman-0.provisioning.openstack.lab.tftp: TFTP, length 33, RRQ "snponly.efi" octet blksize 1468
02:44:27.592279 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:d5:bf:18 (oui Unknown), length 347
02:44:30.677202 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:d5:bf:18 (oui Unknown), length 359
02:44:30.685257 IP 192.168.5.181.imgames > saruman-0.provisioning.openstack.lab.tftp: TFTP, length 58, RRQ "snponly.efiM-}^DM-@M-(^ELy^N M-)M-~M-)M-~M-@M-(^Ed" octet tsize 0 blksize 1468
02:44:52.267802 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:d5:bf:28 (oui Unknown), length 391
02:44:52.268835 IP controller-0.provisioning.openstack.lab.bootps > 192.168.5.201.bootpc: BOOTP/DHCP, Reply, length 328
02:44:59.270320 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:d5:bf:28 (oui Unknown), length 391
02:44:59.271237 IP saruman-0.provisioning.openstack.lab.bootps > 192.168.5.201.bootpc: BOOTP/DHCP, Reply, length 328
02:45:13.331022 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:d5:bf:28 (oui Unknown), length 403
02:45:13.334344 IP controller-0.provisioning.openstack.lab.bootps > 192.168.5.201.bootpc: BOOTP/DHCP, Reply, length 328
```

Revision history for this message
Cristian Le (lecris) wrote :

Curiously I have tried with newer hardware and it managed to avoid this issue, so what could be causing on this old hardware?

affects: neutron → ironic
Revision history for this message
Cristian Le (lecris) wrote (last edit ):
Download full text (3.4 KiB)

More verbose tcpdump data:
```
[tripleo-admin@controller-0 ~]$ sudo tcpdump -vvvv -i vlan50 port 67 or port 68 or port 69 or port 4011
dropped privs to tcpdump
tcpdump: listening on vlan50, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:41:39.857818 IP (tos 0x0, ttl 64, id 51868, offset 0, flags [none], proto UDP (17), length 375)
    0.0.0.0.bootpc > 255.255.255.255.bootps: [udp sum ok] BOOTP/DHCP, Request from 00:25:90:d5:bf:28 (oui Unknown), length 347, xid 0x4371427b, Flags [Broadcast] (0x8000)
          Client-Ethernet-Address 00:25:90:d5:bf:28 (oui Unknown)
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message (53), length 1: Discover
            MSZ (57), length 2: 1464
            Parameter-Request (55), length 35:
              Subnet-Mask (1), Time-Zone (2), Default-Gateway (3), Time-Server (4)
              IEN-Name-Server (5), Domain-Name-Server (6), Hostname (12), BS (13)
              Domain-Name (15), RP (17), EP (18), RSZ (22)
              TTL (23), BR (28), YD (40), YS (41)
              NTP (42), Vendor-Option (43), Requested-IP (50), Lease-Time (51)
              Server-ID (54), RN (58), RB (59), Vendor-Class (60)
              TFTP (66), BF (67), GUID (97), Unknown (128)
              Unknown (129), Unknown (130), Unknown (131), Unknown (132)
              Unknown (133), Unknown (134), Unknown (135)
            GUID (97), length 17: 0.0.0.0.0.0.0.0.0.0.0.0.37.144.213.191.40
            NDI (94), length 3: 1.3.16
            ARCH (93), length 2: 7
            Vendor-Class (60), length 32: "PXEClient:Arch:00007:UNDI:003016"
            END (255), length 0
10:41:43.602882 IP (tos 0x0, ttl 64, id 51869, offset 0, flags [none], proto UDP (17), length 387)
    0.0.0.0.bootpc > 255.255.255.255.bootps: [udp sum ok] BOOTP/DHCP, Request from 00:25:90:d5:bf:28 (oui Unknown), length 359, xid 0x4371427b, Flags [Broadcast] (0x8000)
          Client-Ethernet-Address 00:25:90:d5:bf:28 (oui Unknown)
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message (53), length 1: Request
            Server-ID (54), length 4: 192.168.5.1
            Requested-IP (50), length 4: 192.168.5.127
            MSZ (57), length 2: 65280
            Parameter-Request (55), length 35:
              Subnet-Mask (1), Time-Zone (2), Default-Gateway (3), Time-Server (4)
              IEN-Name-Server (5), Domain-Name-Server (6), Hostname (12), BS (13)
              Domain-Name (15), RP (17), EP (18), RSZ (22)
              TTL (23), BR (28), YD (40), YS (41)
              NTP (42), Vendor-Option (43), Requested-IP (50), Lease-Time (51)
              Server-ID (54), RN (58), RB (59), Vendor-Class (60)
              TFTP (66), BF (67), GUID (97), Unknown (128)
              Unknown (129), Unknown (130), Unknown (131), Unknown (132)
              Unknown (133), Unknown (134), Unknown (135)
            GUID (97), length 17: 0.0.0.0.0.0.0.0.0.0.0.0.37.144.213.191.40
            NDI (94), length 3: 1.3.16
            ARCH (93), length 2: 7
            Vendor-Class (60), length 32: "PXEClient:Arch:00007:UNDI:003016"
            END (255), length 0
10:41:43.610692 IP (tos 0...

Read more...

Revision history for this message
Cristian Le (lecris) wrote :
Download full text (5.8 KiB)

Previous one missing offer and ack. Had to run two tcp-dumps to get both of them and here they are, manually combined:
```
13:15:22.049067 00:25:90:d5:bf:28 > Broadcast, ethertype IPv4 (0x0800), length 389: (tos 0x0, ttl 64, id 36734, offset 0, flags [none], proto UDP (17), length 375)
    0.0.0.0.bootpc > 255.255.255.255.bootps: [udp sum ok] BOOTP/DHCP, Request from 00:25:90:d5:bf:28, length 347, xid 0xe3b3075d, Flags [Broadcast] (0x8000)
          Client-Ethernet-Address 00:25:90:d5:bf:28
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message (53), length 1: Discover
            MSZ (57), length 2: 1464
            Parameter-Request (55), length 35:
              Subnet-Mask (1), Time-Zone (2), Default-Gateway (3), Time-Server (4)
              IEN-Name-Server (5), Domain-Name-Server (6), Hostname (12), BS (13)
              Domain-Name (15), RP (17), EP (18), RSZ (22)
              TTL (23), BR (28), YD (40), YS (41)
              NTP (42), Vendor-Option (43), Requested-IP (50), Lease-Time (51)
              Server-ID (54), RN (58), RB (59), Vendor-Class (60)
              TFTP (66), BF (67), GUID (97), Unknown (128)
              Unknown (129), Unknown (130), Unknown (131), Unknown (132)
              Unknown (133), Unknown (134), Unknown (135)
            GUID (97), length 17: 0.0.0.0.0.0.0.0.0.0.0.0.37.144.213.191.40
            NDI (94), length 3: 1.3.16
            ARCH (93), length 2: 7
            Vendor-Class (60), length 32: "PXEClient:Arch:00007:UNDI:003016"
            END (255), length 0
13:15:22.051339 00:25:90:d5:bf:28 > Broadcast, ethertype IPv4 (0x0800), length 383: (tos 0x0, ttl 64, id 36734, offset 0, flags [none], proto UDP (17), length 369)
    0.0.0.0.bootpc > 255.255.255.255.bootps: [no cksum] BOOTP/DHCP, Reply, length 341, xid 0xe3b3075d, Flags [Broadcast] (0x8000)
          Your-IP 192.168.5.139
          Server-IP 192.168.5.76
          Client-Ethernet-Address 00:25:90:d5:bf:28
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message (53), length 1: Offer
            BF (67), length 11: "snponly.efi"
            Unknown (253), length 4: 3232236876
            Classless-Static-Route (121), length 14: (169.254.169.254/32:192.168.5.100),(default:192.168.5.1)
            Domain-Name-Server (6), length 4: 10.0.1.1
            Lease-Time (51), length 4: 43200
            MTU (26), length 2: 1500
            Subnet-Mask (1), length 4: 255.255.255.0
            Default-Gateway (3), length 4: 192.168.5.1
            Server-ID (54), length 4: 192.168.5.1
            TFTP (66), length 12: "192.168.5.76"
            TFTP-Server-Address (150), length 4: 192.168.5.76
            PAD (0), length 0, occurs 4
            END (255), length 0
            PAD (0), length 0, occurs 4
13:15:25.574485 00:25:90:d5:bf:28 > Broadcast, ethertype IPv4 (0x0800), length 401: (tos 0x0, ttl 64, id 36735, offset 0, flags [none], proto UDP (17), length 387)
    0.0.0.0.bootpc > 255.255.255.255.bootps: [udp sum ok] BOOTP/DHCP, Request from 00:25:90:d5:bf:28, length 359, xid 0xe3b3075d, Flags [Broadcast] (0x8000)
          Client-Ethernet-Address 00:25:90:d5:bf:28
    ...

Read more...

Revision history for this message
Cristian Le (lecris) wrote (last edit ):

Here is the tcpdump on inspector which actually works.

Noteworthy: DHCP passes as "snponly.efi^@" not as "snponly.efi" and that seems to make it bootable

Revision history for this message
Cristian Le (lecris) wrote :

Comparing the dnsmasq between neutron:
```
tag:subnet-bd64fd35-9235-448d-a1c8-69e7500b3160,option:dns-server,10.0.1.1
tag:subnet-bd64fd35-9235-448d-a1c8-69e7500b3160,option:router,192.168.5.1
tag:port-e86d7ad1-0350-4d95-9dcc-a475e13de7e3,tag:ipxe,67,http://192.168.5.76:8088/boot.ipxe
tag:port-e86d7ad1-0350-4d95-9dcc-a475e13de7e3,150,192.168.5.76
tag:port-e86d7ad1-0350-4d95-9dcc-a475e13de7e3,tag:!ipxe,67,snponly.efi
tag:port-e86d7ad1-0350-4d95-9dcc-a475e13de7e3,66,192.168.5.76
tag:port-e86d7ad1-0350-4d95-9dcc-a475e13de7e3,option:server-ip-address,192.168.5.76
tag:port-9e2f125b-2538-4dd6-8260-2f34d3f9023d,tag:!ipxe,67,snponly.efi
tag:port-9e2f125b-2538-4dd6-8260-2f34d3f9023d,tag:ipxe,67,http://192.168.5.76:8088/boot.ipxe
tag:port-9e2f125b-2538-4dd6-8260-2f34d3f9023d,option:server-ip-address,192.168.5.76
tag:port-9e2f125b-2538-4dd6-8260-2f34d3f9023d,150,192.168.5.76
tag:port-9e2f125b-2538-4dd6-8260-2f34d3f9023d,66,192.168.5.76
```
And ironic:
```
port=0
interface=vlan50

dhcp-range=192.168.5.201,192.168.5.250,10m
dhcp-option=option:router,192.168.5.1
dhcp-sequential-ip
dhcp-match=ipxe,175
dhcp-match=set:efi,option:client-arch,7
dhcp-match=set:efi,option:client-arch,9
dhcp-match=set:efi,option:client-arch,11
# dhcpv6s for Client System Architecture Type (61)
dhcp-match=set:efi6,option6:61,0007
dhcp-match=set:efi6,option6:61,0009
dhcp-match=set:efi6,option6:61,0011
dhcp-userclass=set:ipxe6,iPXE
# Client is already running iPXE; move to next stage of chainloading
dhcp-boot=tag:ipxe,http://192.168.5.76:8088/inspector.ipxe
dhcp-option=tag:ipxe6,option6:bootfile-url,http://192.168.5.76:8088/inspector.ipxe
# Client is PXE booting over EFI without iPXE ROM; send EFI version of iPXE chainloader
dhcp-boot=tag:efi,tag:!ipxe,snponly.efi
dhcp-option=tag:efi6,tag:!ipxe6,option6:bootfile-url,tftp://192.168.5.76/snponly.efi
# Client is running PXE over BIOS; send BIOS version of iPXE chainloader
dhcp-boot=undionly.kpxe,localhost.localdomain,192.168.5.76

dhcp-hostsdir=/var/lib/ironic-inspector/dhcp-hostsdir
```

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

so the tl;dr at this point:

* The static dhcp configuration via inspector *is* sending null character terminated strings. i.e. 12 chars, "snponly.efi^@", where as the configuration coming out of neutron sends just "snponly.efi" with an eleven character field length, which is ultimately what we tend to expect.

* The card, appears to be expecting the null character fields, and that is ultimately preventing the firmware from requesting an invalid file because it doesn't find the null char when it goes to access it's own memory, because it presumably didn't add one to begin with to safely delimit the field value, and then retrieves some chunk of memory and sends it over the wire.

* Ultimately, this is likely a firmware issue, although the dnsmasq behavior is weird. A quick check of upstream dnsmasq code doesn't reveal any editing of the value, which makes this even more suspicious.

I'm going to ping Harald Jensas to take a look since he is aware of the inner workings of the dnsmasq code far better than I, and see if we can get a second opinion.

In the mean time, we've requested the reporter to try and update firmware. We did manage to find the links on the supermicro website for them, so hopefully that resolves all of this.

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Sorry, took me some time before I've had a chance to look at this.

So with inspectors DHCP server we get:
  BF (67), length 12: "snponly.efi^@"

And with Neutron DHCP server we get:
  BF (67), length 11: "snponly.efi"

The option number, and then the length (number of chars) to read for the value is what is supposed to tell the client (firmware) the filename to download. I'm not sure where the null character is coming from in the inspector case. But the lenght variable is correct in both cases 12 (inspector case with trailing null char) vs 11 chars in the neutron case.

 Code Len Bootfile name
 +-----+-----+-----+-----+-----+---
 | 67 | n | c1 | c2 | c3 | ...
 +-----+-----+-----+-----+-----+---

With the correct lenght in both cases, the firmware should really be able to do the right thing.
I.e firmware is certainly the primary suspect here.

Now, the trailing null char in inspector seems strange. Looking at tripleo-heat-templates and puppet-ironic repos I don't see where that is added.

Can you cat the file with --show-ends (-E) --show-tabs (-T) --show-nonprinting (-v) options enabled?
For example:
  cat -vET /var/lib/config-data/puppet-generated/ironic_inspector/etc/ironic-inspector/dnsmasq.conf

Is the null character in the config file?

Revision history for this message
Cristian Le (lecris) wrote (last edit ):

I did check that a while ago, and no, there were no special characters in the dnsmasq.conf. I will confirm later on once the stack is back up.

Confirmed there is no special character or anything

Revision history for this message
Cristian Le (lecris) wrote :

Super weird update.

I have redeployed the stack, and now ironic (not inspector) is correctly serving:
 BF (67), length 12: "snponly.efi^@"

I cannot see any meaningful changes, but here are some differences:
- Now the stack has 4 ControllerNFS instead of 1 Controller
- One controller node is now a virtual machine

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

Marking this as invalid. It looks like you fixed the environmental issue that caused your problem. If you believe there's still a systematic problem in Ironic, please update the bug. Thanks!

Changed in ironic:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.