Five minute delay DHCP'ing isolated nics

Bug #1653812 reported by Ben Nemec
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Ben Nemec

Bug Description

A new variation on https://bugs.launchpad.net/tripleo/+bug/1626673

This time it looks like it's the network systemd target that it taking 5 minutes. Because we set ONBOOT=yes in the nic config files the network target (which runs after the dhcp-all-interfaces services) tries to DHCP the isolated nics again and fails just like dhcp-all-interfaces did. However, because we set the timeout in the dhcp-all-interfaces service file the network target still waits for 5 full minutes before allowing boot to continue.

Note that as before, this is costing us a non-trivial amount of time in both CI and real net-iso deployments. I've definitely seen this when booting the overcloud-full image, but if it's like before then we're losing 5 minutes every time the IPA ramdisk boots too which means this would cost us 15 minutes per nonha deployment (5 for introspection, 5 for deploy, 5 for image boot).

I was hoping that since we add explicit service files for all the interfaces then maybe we wouldn't need ONBOOT, but it appears that the udev rule doesn't get applied on reboot so that is breaking networking in my tests. I need to investigate further with an image that I can login to if networking breaks.

Tags: networking
Changed in tripleo:
milestone: none → ocata-3
tags: added: networking
Revision history for this message
Dan Prince (dan-prince) wrote :

On configuration (in t-h-t) could we update the configured nics accordingly then? Perhaps changing the ONBOOT setting during configuration time so that reboot is happy.

Revision history for this message
Ben Nemec (bnemec) wrote :

That might be an option, but it occurred to me that the only reason I had trouble on reboot was that I'm using a net-iso nic setup on my vms, but not actually using net-iso. If net-iso were in use then all of the nics would be statically configured on initial deploy and the reboot problem goes away.

The approach I'm currently investigating is to reorder the service dependency so that dhcp-all-interfaces runs after network but before network-online. That way network will run, won't do anything with the interfaces since they aren't configured yet, but they'll get brought up by dhcp-all-interfaces before we hit network-online and boot continues to network-requiring services. The only major drawback I can see is that we'd still have a 5 minute delay on reboot in the "many nics, but not using net-iso" case, but that's not a production-ready configuration anyway so it should only be happening in dev and test where we don't tend to reboot anyway.

Revision history for this message
Ben Nemec (bnemec) wrote :
Download full text (4.5 KiB)

# network.target starts before interfaces are configured and basically does nothing
Jan 4 16:06:57 localhost systemd: Reached target Network.
Jan 4 16:06:57 localhost systemd: Starting Network.
# dhcp-all-interfaces brings up the nics
Jan 4 16:06:57 localhost systemd: Starting DHCP interface eth3...
Jan 4 16:06:57 localhost systemd: Starting System Logging Service...
Jan 4 16:06:57 localhost systemd: Starting Dynamic System Tuning Daemon...
Jan 4 16:06:57 localhost systemd: Starting DHCP interface eth1...
Jan 4 16:06:57 localhost systemd: Starting Logout off all iSCSI sessions on shutdown...
Jan 4 16:06:57 localhost systemd: Starting OpenSSH server daemon...
Jan 4 16:06:57 localhost systemd: Starting DHCP interface eth2...
Jan 4 16:06:57 localhost systemd: Starting Xinetd A Powerful Replacement For Inetd...
Jan 4 16:06:57 localhost systemd: Starting DHCP interface eth4...
Jan 4 16:06:57 localhost systemd: Starting DHCP interface eth0...
Jan 4 16:06:57 localhost systemd: Starting DHCP interface eth5...
Jan 4 16:06:57 localhost systemd: Starting Postfix Mail Transport Agent...
Jan 4 16:06:57 localhost systemd: Starting Open vSwitch...
Jan 4 16:06:57 localhost systemd: Starting Notify NFS peers of a restart...
Jan 4 16:06:57 localhost systemd: Starting Dynamic Login...
Jan 4 16:06:57 localhost dhcp-all-interfaces.sh: Inspecting interface: eth2...Configured eth2
Jan 4 16:06:57 localhost systemd: Started System Logging Service.
Jan 4 16:06:57 localhost sm-notify[851]: Version 1.3.0 starting
Jan 4 16:06:57 localhost dhcp-all-interfaces.sh: Inspecting interface: eth4...Configured eth4
Jan 4 16:06:57 localhost dhcp-all-interfaces.sh: Inspecting interface: eth3...Configured eth3
Jan 4 16:06:57 localhost systemd: Started Logout off all iSCSI sessions on shutdown.
Jan 4 16:06:57 localhost dhcp-all-interfaces.sh: Inspecting interface: eth0...Configured eth0
Jan 4 16:06:57 localhost xinetd[871]: xinetd Version 2.3.15 started with libwrap loadavg labeled-networking options compiled in.
Jan 4 16:06:57 localhost xinetd[871]: Started working: 0 available services
Jan 4 16:06:57 localhost dhcp-all-interfaces.sh: Inspecting interface: eth1...Configured eth1
Jan 4 16:06:57 localhost dhcp-all-interfaces.sh: Inspecting interface: eth5...Configured eth5

...

# The nics that don't get DHCP time out as expected
Jan 4 16:07:27 localhost systemd: <email address hidden> start operation timed out. Terminating.
Jan 4 16:07:27 localhost systemd: <email address hidden> start operation timed out. Terminating.
Jan 4 16:07:27 localhost systemd: <email address hidden> start operation timed out. Terminating.
Jan 4 16:07:27 localhost systemd: <email address hidden> start operation timed out. Terminating.
Jan 4 16:07:27 localhost systemd: <email address hidden> start operation timed out. Terminating.
Jan 4 16:07:27 localhost ifup: Determining IP information for eth2...
Jan 4 16:07:27 localhost ifup: Determining IP information for eth4...
Jan 4 16:07:27 localhost ifup: Determining IP information for eth3...
Jan 4 16:07:27 localhost ifup: Determining IP information for eth5...
Jan 4 16:07:27 localhost ifup: Determining ...

Read more...

Revision history for this message
Ben Nemec (bnemec) wrote :
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Steven Hardy (shardy) wrote :

https://review.openstack.org/#/c/416664/ landed, can we close this now?

Revision history for this message
Ben Nemec (bnemec) wrote :

Yep, I've confirmed that the 5 minute delay is gone.

Changed in tripleo:
status: In Progress → Fix Committed
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.