DHCP discover retries are too few and too seldom

Bug #1273159 reported by Darragh O'Reilly
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
CirrOS
Triaged
Medium
Unassigned

Bug Description

The 0.3.1 image uses the following dhcp client parameters by default:

$ ps -ef | grep dhcp
  227 root udhcpc -R -n --timeout=60 -p /var/run/udhcpc.eth0.pid -i eth0

It only sends up to 3 DHCP discover packets with a 60 second pause between. So when there is no DHCP replies from a server for whatever reason, it is like this:

t=0 send and wait 60
t=60 send and wait 60
t=120 send and give up, if this packet is not answered then the instance will never work without manual intervention.

A more reasonable default would be to retry 100 times with just 5 second pauses between.

This can be worked around by adding this line to the eth0 stanza of /etc/network/interfaces

udhcpc_opts -t 100 -T 5

and making a snapshot and booting from the snapshot instead.

Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

$ udhcpc -?
udhcpc: invalid option -- ?
BusyBox v1.20.1 (2013-02-08 03:29:16 UTC) multi-call binary.

Usage: udhcpc [-fbnqoCRB] [-i IFACE] [-r IP] [-s PROG] [-p PIDFILE]
 [-V VENDOR] [-x OPT:VAL]... [-O OPT]...

 -i,--interface IFACE Interface to use (default eth0)
 -p,--pidfile FILE Create pidfile
 -s,--script PROG Run PROG at DHCP events (default /usr/share/udhcpc/default.script)
 -B,--broadcast Request broadcast replies
 -t,--retries N Send up to N discover packets
 -T,--timeout N Pause between packets (default 3 seconds)
 -A,--tryagain N Wait N seconds after failure (default 20)
 -f,--foreground Run in foreground
 -b,--background Background if lease is not obtained
 -n,--now Exit if lease is not obtained
 -q,--quit Exit after obtaining lease
 -R,--release Release IP on exit
 -S,--syslog Log to syslog too
 -a,--arping Use arping to validate offered address
 -O,--request-option OPT Request option OPT from server (cumulative)
 -o,--no-default-options Don't request any options (unless -O is given)
 -r,--request IP Request this IP address
 -x OPT:VAL Include option OPT in sent packets (cumulative)
    Examples of string, numeric, and hex byte opts:
    -x hostname:bbox - option 12
    -x lease:3600 - option 51 (lease time)
    -x 0x3d:0100BEEFC0FFEE - option 61 (client id)
 -F,--fqdn NAME Ask server to update DNS mapping for NAME
 -V,--vendorclass VENDOR Vendor identifier (default 'udhcp VERSION')
 -C,--clientid-none Don't send MAC as client identifier
Signals:
 USR1 Renew lease
 USR2 Release lease

Revision history for this message
Harm Weites (harmw) wrote :

If my dhcp server would actually need so much time to give out an ip, I'd rather be interested in fixing the server-end instead of waiting for ages for cirros probably not even receiving an ip at all.

I'd say give cirros a reasonable default (10seconds?) in which it takes care of setting up DHCP. If it fails to succeed within that timeframe it should continue for another x seconds in the background and then just give up since something is almost certainly broken :)

In addition to this, cirros also waits quite long for metadata responses. That's perhaps something to change aswell. Having to wait 20 iterations is pointless. I'd say a max of 5 tries would suffice, unless there are clouds that realy only respond to the 20th try :)

Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

The issue was seen in OpenStack CI testing when there was a lot of concurrent requests to be processed. It took the various parts of OpenStack networking about 20-30 seconds to get the wiring (vlan tags and flows on vswitches) setup between the DHCP server and the Cirros VM. This is much too long and is being addressed. But this delay mean that the connectivity was not in place when Cirros sent the first DHCP request. The test itself timed out after 60 seconds, not long enough for the second request, and the test was marked as a failure. This test timeout is too short too and has been changed to 120 seconds. Anyway I think the Cirros image should have a shorter retry interval, something more in line with full blown servers.

Also the metadata retries even if no IP address was received.

Revision history for this message
Scott Moser (smoser) wrote :

hm... your comment about 'udhcpc_opts', have you actually seen that work?
i'd love it if it did, but:

$ cat /etc/network/interfaces
# Configure Loopback
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet dhcp
udhcpc_opts -t 100 -T 5
$ ps axw | grep udhcp
   79 root udhcpc -R -n --timeout=60 -O staticroutes -p /var/run/udhcpc.eth0.pid -i eth0 -t 100 -T 5
  145 cirros grep udhcp

so it doesn't seem to.

Changed in cirros:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

hi, I couldn't remember, so I tried it again and it works fine.

I killed dnsmasq and started tcpdump on the linux bridge the vm would be connected to. Then booted off a snapshot with the modified /etc/network/interfaces and could see the reties every 5 seconds, and it gave up after 99 tries. I have attached the dump.

Revision history for this message
Scott Moser (smoser) wrote :

With the feature added in 0.3.3 to customize options given to udhcpc we can much more easily control things like this now.

Revision history for this message
rick jones (perfgeek) wrote :

Horses and barn doors since this is an old bug, but I thought I'd weigh-in with an opinion that a fixed 5 second timeout isn't really such a good idea to configure. Imagine a scenario akin to OpenStack where there may be hundreds if not thousands of instances attempting to get their DHCP information. With the time it can take to setup the plumbing, and the possibility of all those queries really bogging-down the dnsmasq or spiking its own plumbing, it would be better to have a backoff on the timeouts akin to what TCP does to avoid congestive collapse. So start at some reasonable (default) minimum and double it on each retry. Or at least increase the timeout by some quantity on each retry if you cannot stomach doubling it.

Revision history for this message
Nadav Goldin (ngoldin) wrote :

I see the same issue - at first I thought this was due to libvirt:
https://bugzilla.redhat.com/show_bug.cgi?id=1411025
But as implied there, simply increasing the attempts, or running dhcp client again resolves the issue.
Can anyone confirm if this was fixed in the latest images, or it requires updating the udhcpc configuration in the image?

Revision history for this message
kirandevraaj (kirandevraaj) wrote :

I hit by this issue with cirros 0.3.5 image. Ran tests with openstack rally. first 10 iterations of server creation was successful, rest 20 failed due to not getting ip lease from dhcp server. instances are waiting for 60 secs with 3 retries.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

The following bug is somewhat related: https://bugs.launchpad.net/cirros/+bug/1768955 The reference to libvirt is spot on here in comments because apparently different interface model types have different link initialization characteristics that may affect DHCP negotiation.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.