Comment 0 for bug 1900668

Revision history for this message
Michał Ajduk (majduk) wrote :

# ENVIRONMENT
MAAS version (SNAP):
  maas 2.8.2-8577-g.a3e674063 8980 2.8/stable canonical✓ -

  MAAS was cleanly installed. KVM POD setup works.

  MAAS status:
  bind9 RUNNING pid 9258, uptime 15:13:02
  dhcpd RUNNING pid 26173, uptime 15:09:30
  dhcpd6 STOPPED Not started
  http RUNNING pid 19526, uptime 15:10:49
  ntp RUNNING pid 27147, uptime 14:02:18
  proxy RUNNING pid 25909, uptime 15:09:33
  rackd RUNNING pid 7219, uptime 15:13:20
  regiond RUNNING pid 7221, uptime 15:13:20
  syslog RUNNING pid 19634, uptime 15:10:48

Servers:
HPE DL380 Gen10 configured to UEFI boot via PXE (PXE legacy mode), Secure boot disabled. All servers (18) experience the described problem.

UEFI Boot menu contains 2 entries alowing one to select the PXE mode:
- HPE Ethernet 1Gb 4-port 366FLR Adapter - NIC (HTTP(S) IPv4)
- HPE Ethernet 1Gb 4-port 366FLR Adapter - NIC (PXE IPv4)

# PROBLEM DESCRIPTION
Similiar to https://bugs.launchpad.net/maas/+bug/1899840

PXE boot stalls after downloading grubx64.efi but before downloading grub.cfg:
2020-10-20 07:18:21 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.216.240.69
2020-10-20 07:18:21 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.216.240.69
2020-10-20 07:18:21 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.216.240.69

Grub drops to the grub prompt.
Within the grub prompt:
- net_ls_addr shows correct IP address
- net_ls_routes shows correct routing
- net_bootps (that should initialize DHCP request from grub) fails with a message: failed to send packet

We've also noticed that in a working scenario grub just after start up but before downloading grub conf sends arp request for MAAS IP:
13517 2020-10-19 13:53:38.864937 HewlettP_02:3d:e8 Broadcast ARP 60 Who has 10.216.240.1? Tell 10.216.240.51
and MAAS replies.

When the boot stalls, one of the symptoms is that grub does not send the ARP request for MAAS IP. It also does not reply to MAAS ARP requests. It looks as if the EFI_NET stack was failing.

# WORKAROUNDS
1) during the the PXE boot send ARP requests from MAAS to query the node IP. This seems to prevent the node from loosing connectivity.

Tested 4 times on independent nodes.

2) Custom built grub:
grub-mkimage -c grub.conf -o grubx64.efi -O x86_64-efi -p /grub normal configfile tftp memdisk boot diskfilter efifwsetup efi_gop efinet ls net normal part_gpt tar ext2 linuxefi http echo chain search search_fs_uuid search_label search_fs_file test tr true minicmd

Grub version: 2.02-2ubuntu8.18

The grub PXE image built in the way described above works on all nodes (18) all the time (4 times tested).

When I've included grub module linix.mod, I've managed to reproduce the described problem.

It seems that the issue can be related to https://savannah.gnu.org/bugs/?func=detailitem&item_id=50715