dhclient: thread concurrency race leads to DHCPOFFER packets not being received
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
bind9-libs (Ubuntu) |
Won't Fix
|
Undecided
|
Unassigned | ||
isc-dhcp (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
High
|
Matthew Ruffell | ||
Jammy |
Fix Released
|
High
|
Mauricio Faria de Oliveira |
Bug Description
[Impact]
* Occasionally, during instance boot or machine start-up,
dhclient will attempt to acquire a dhcp lease and fail,
leaving the instance with no IP address and making it
unreachable.
* This happens about once every 100 reboots on bare metal,
or affecting between ~0.3% to 2% of deployments on Azure
(comment #2).
* Azure uses dhclient called from cloud-init instead of
systemd-
deployments.
* The logs of an affected dhclient produce the following:
Listening on LPF/enp1s0/
Sending on LPF/enp1s0/
Sending on Socket/fallback
DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
...
(omitting 20 similar lines)
...
DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
No DHCPOFFERS received.
No working leases in persistent database - sleeping.
* This only impacts Focal and Jammy, where bind9-libs
are multi-threaded (Bionic/earlier and Kinetic/later
are single-threaded).
* The actual problem is dhclient containing a thread
concurrency race condition, and when the race occurs,
the read socket is incorrectly/
because required structures are not yet consistent,
thus dhclient does not read any DHCPOFFER replies.
* Detailed analysis of the issue is in comment #17.
[Fix]
* Prevent the race condition by starting to watch the
read socket after required structures are consistent.
* The fix has been tested in Azure w/ 13500 instances,
and no errors have been observed (previously: 0.4%).
* Anyway, in case regressions are observed, the patch
introduces 2 switches to revert to previous behavior,
which can be applied per-process or system-wide:
- DHCP_FD_
- dhcp.fd_
* (Previous approaches/
bind9-libs to single-threaded, but we concluded it
would have more regression risk than the expected
[some bits in comment #8, and some internal chat],
and remove exported symbols (apparently unused, but).
We also considered a mutex/spinlock approach, but
later found a simpler way w/ isc lib; comment #13.)
[Test Plan]
* Synthetic reproducer with GDB to force the race
condition, and DHCP server/client/noise injection
is described in comment #9.
* Test with the original package (problem occurs).
* Test with the modified package (problem fixed).
- Set DHCP_FD_
- Set dhcp.fd_
[Regression Potential]
* 1) dhclient failing to acquire DHCP leases.
* 2) dhcpd is also affected by code changes,
thus failures to handle DHCP lease requests
also have potential for regressions.
* 3) the functional change added by the fix,
if a regression were to occur, would likely
be an issue only under some (unknown) race
condition as well, thus expected to be rare.
* Note: this potentially affects Focal/Jammy
on Azure as a whole, per usage of dhclient
in cloud-init instead of systemd-networkd.
Azure provided extensive testing for all 3
approaches (mostly internal communications,
and some bug comments), with ~13k instances.
No issues were observed (previously: 0.4%).
* Such testing scale seems to indicate that
there are no regressions for dhclient to
acquire DHCP leases (1), nor another race
condition that hit the fix/new behavior (3).
With that, apparently (2) should be OK too.
* Also, so to mitigate the regression risk
as much as possible, there's very detailed
analysis provided here (comments #17, #18)
and more information about the fix in its
patch file's comment.
description: | updated |
summary: |
- dhclient doesn't receive dhcp offer from kernel + dhclient: thread concurrency race leads to DHCPOFFER packets not being + received |
tags: | added: sts-sponsor |
tags: | removed: sts-sponsor |
description: | updated |
no longer affects: | bind9-libs (Ubuntu Focal) |
no longer affects: | bind9-libs (Ubuntu Jammy) |
Changed in bind9-libs (Ubuntu): | |
status: | Fix Released → Won't Fix |
no longer affects: | bind9-libs (Ubuntu Focal) |
no longer affects: | bind9-libs (Ubuntu Jammy) |
Changed in isc-dhcp (Ubuntu Focal): | |
status: | New → In Progress |
Changed in isc-dhcp (Ubuntu Jammy): | |
status: | New → In Progress |
Changed in isc-dhcp (Ubuntu Focal): | |
importance: | Undecided → High |
Changed in isc-dhcp (Ubuntu Jammy): | |
importance: | Undecided → High |
Changed in isc-dhcp (Ubuntu Focal): | |
assignee: | nobody → Matthew Ruffell (mruffell) |
Changed in isc-dhcp (Ubuntu Jammy): | |
assignee: | nobody → Mauricio Faria de Oliveira (mfo) |
tags: | added: se-sru-sponsor-mfo |
Changed in isc-dhcp (Ubuntu Focal): | |
status: | Fix Committed → Fix Released |
status: | Fix Released → Fix Committed |
We've been investigating a similar issue in Ubuntu 20.04 (and now 22.04) on Azure where Running PPS re-use fails to perform DHCP for 5 minutes when dhclient is invoked by cloud-init. dhclient is run by cloud-init, but sees no DHCPOFFER. It varies due to unknown reasons but it has affected a ~0.3-2% of deployments in this scenario over time.
We instrumented our images to capture network traffic and see what is happening and sure enough DHCP offers are coming through to the guest by dhclient doesn't see them. We instrumented dhclient and the "got_one()" callback is never invoked in these failures.
18.04 does not have this issue.
This behavior can be reproduced multiple ways: .sh). It may take a while, but even this simple test will reproduce this behavior in ~50k runs for me in an LXD VM.
- Reproduce similar test environment to above scenario using cloud-init (switch hyperv nic to a different vnet while waiting the link status to reset, then perform dhcp). This test case will reproduce in ~1,500 runs, though it varies and requires more complex setup.
- Repeatedly run dhclient in a loop until it fails (see test-sequential
- Simply launch instances of dhclient in parallel (see test-parallel.sh). There is an excellent chance at least one of those dhclients will fail this way.
I noticed the uprev of bind9 libs in focal: 16+dfsg- 3~build1 16+dfsg- 3~ubuntu1 19+dfsg- 2.1ubuntu1 19+dfsg- 2.1ubuntu3 19+dfsg- 2.1ubuntu3
focal (net): 1:9.11.
focal-updates (net): 1:9.11.
impish (net): 1:9.11.
jammy (net): 1:9.11.
kinetic (net): 1:9.11.
I couldn't find any related issue on the isc-dhcp tracker, etc. I did build dhclient from the Debian master branch (https:/ /salsa. debian. org/debian/ isc-dhcp/ -/commits/ master/ debian) which uses the in-tree bind libs and that seems to have addressed the issue for all scenarios. Not that it helps much to bisect this just yet.