undercloud reboot can have an ironic dnsmasq failure

Bug #1615996 reported by Michele Baldessari
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Expired
Undecided
Unassigned

Bug Description

In order to fix certain service start failures after an undercloud reboot (https://bugzilla.redhat.com/show_bug.cgi?id=1348700) we added the sysctl nonlocal_bind which fixes the fact that services did a bind() call on an ip address that did not exist yet. This all works correctly except for ironic-dnsmasq. We saw this issue with nonlocal_bind enabled:

Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Starting PXE boot dnsmasq service for Ironic Inspector...
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com dnsmasq[12297]: dnsmasq: unknown interface br-ctlplane
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: openstack-ironic-inspector-dnsmasq.service: control process exited, code=exited status=2
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Failed to start PXE boot dnsmasq service for Ironic Inspector.
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Unit openstack-ironic-inspector-dnsmasq.service entered failed state.
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: openstack-ironic-inspector-dnsmasq.service failed.

A common configuration for /etc/ironic-inspector/dnsmasq.conf is the following:
"""
port=0
interface=br-ctlplane
bind-interfaces
dhcp-range=192.0.2.100,192.0.2.120,29
enable-tftp
tftp-root=/tftpboot
dhcp-sequential-ip
dhcp-match=ipxe,175
# Client is running iPXE; move to next stage of chainloading
dhcp-boot=tag:ipxe,http://192.0.2.1:8088/inspector.ipxe
dhcp-boot=undionly.kpxe,localhost.localdomain,192.0.2.1
"""

The reason for the failure is that dnsmasq.c does not just bind() to the address, but it also checks for the interface name. See src/dnsmasq.c (OPT_NOWILD is when bind-interfaces is set):
"""
...
if (option_bool(OPT_NOWILD) || option_bool(OPT_CLEVERBIND))
  {
    create_bound_listeners(1);

    if (!option_bool(OPT_CLEVERBIND))
      for (if_tmp = daemon->if_names; if_tmp; if_tmp = if_tmp->next)
        if (if_tmp->name && !if_tmp->used)
          die(_("unknown interface %s"), if_tmp->name, EC_BADNET);
"""

So a correct fix here is to remove the bind-interfaces and add the bind-dynamic options.

Revision history for this message
Michele Baldessari (michele) wrote :

Before we switch puppet-ironic to bind-dynamic we should verify that it won't listen automatically to any interface that gets added after dnsmasq has started. The manpage seems to imply so:

--bind-dynamic
Enable a network mode which is a hybrid between --bind-interfaces and the
default. Dnsmasq binds the address of individual interfaces, allowing multiple
dnsmasq instances, but if new interfaces or addresses appear, it automatically
listens on those (subject to any access-control configuration). This makes
dynamically created interfaces work in the same way as the default.
Implementing this option requires non-standard networking APIs and it is only
available under Linux. On other platforms it falls-back to --bind-interfaces
mode.

If that is confirmed, I am not sure what the best fix could be

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → ongoing
Revision history for this message
Emilien Macchi (emilienm) wrote :

This bug was last updated over 180 days ago, as tripleo is a fast moving project and we'd like to get the tracker down to currently actionable bugs, this is getting marked as Invalid. If the issue still exists, please feel free to reopen it.

Changed in tripleo:
status: Triaged → Invalid
Revision history for this message
DanCreed (dan-creed) wrote :

I can confirm in the latest delorean repos, this prevents the ironic_inspector_dnsmasq container from starting. bind-dynamic does need to be added to the dnsmasq.conf for this container like below:

port=0
except-interface=lo
bind-dynamic
interface=br-ctlplane

log-dhcp
log-queries

dhcp-range=set:ctlplane-subnet,192.168.24.100,192.168.24.120,255.255.255.0,10m
dhcp-option=tag:ctlplane-subnet,option:router,192.168.24.1
dhcp-sequential-ip
dhcp-match=ipxe,175
dhcp-match=set:efi,option:client-arch,7
dhcp-match=set:efi,option:client-arch,9
dhcp-match=set:efi,option:client-arch,11
# Client is already running iPXE; move to next stage of chainloading
dhcp-boot=tag:ipxe,http://192.168.24.1:8088/inspector.ipxe
# Client is PXE booting over EFI without iPXE ROM; send EFI version of iPXE chainloader
dhcp-boot=tag:efi,tag:!ipxe,ipxe.efi
# Client is running PXE over BIOS; send BIOS version of iPXE chainloader
dhcp-boot=undionly.kpxe,localhost.localdomain,192.168.24.1

dhcp-hostsdir=/var/lib/ironic-inspector/dhcp-hostsdir

Changed in tripleo:
status: Invalid → Confirmed
Changed in tripleo:
milestone: none → stein-2
status: Confirmed → Triaged
Revision history for this message
Bob Fournier (bfournie) wrote :

Dan - can you provide logs showing the failure before adding bind-dynamic? Which version are you testing?

Revision history for this message
Emilien Macchi (emilienm) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (FUTURE, PIKE, QUEENS, ROCKY, STEIN).
  Valid example: CONFIRMED FOR: FUTURE

Changed in tripleo:
importance: High → Undecided
status: Triaged → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.