dhclient: thread concurrency race leads to DHCPOFFER packets not being received

Bug #1926139 reported by Martijn van Duren
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bind9-libs (Ubuntu)
Won't Fix
Undecided
Unassigned
isc-dhcp (Ubuntu)
Invalid
Undecided
Unassigned
Focal
Fix Released
High
Matthew Ruffell
Jammy
Fix Released
High
Mauricio Faria de Oliveira

Bug Description

[Impact]

 * Occasionally, during instance boot or machine start-up,
   dhclient will attempt to acquire a dhcp lease and fail,
   leaving the instance with no IP address and making it
   unreachable.

 * This happens about once every 100 reboots on bare metal,
   or affecting between ~0.3% to 2% of deployments on Azure
   (comment #2).

 * Azure uses dhclient called from cloud-init instead of
   systemd-networkd, and this is causing issues with larger
   deployments.

 * The logs of an affected dhclient produce the following:

   Listening on LPF/enp1s0/52:54:00:1c:d7:00
   Sending on LPF/enp1s0/52:54:00:1c:d7:00
   Sending on Socket/fallback
   DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
   DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
   ...
   (omitting 20 similar lines)
   ...
   DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
   DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
   DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
   No DHCPOFFERS received.
   No working leases in persistent database - sleeping.

 * This only impacts Focal and Jammy, where bind9-libs
   are multi-threaded (Bionic/earlier and Kinetic/later
   are single-threaded).

 * The actual problem is dhclient containing a thread
   concurrency race condition, and when the race occurs,
   the read socket is incorrectly/prematurely unwatched
   because required structures are not yet consistent,
   thus dhclient does not read any DHCPOFFER replies.

 * Detailed analysis of the issue is in comment #17.

[Fix]

 * Prevent the race condition by starting to watch the
   read socket after required structures are consistent.

 * The fix has been tested in Azure w/ 13500 instances,
   and no errors have been observed (previously: 0.4%).

 * Anyway, in case regressions are observed, the patch
   introduces 2 switches to revert to previous behavior,
   which can be applied per-process or system-wide:
   - DHCP_FD_FLAGS_POKE=0 environment variable
   - dhcp.fd_flags_poke=0 kernel cmdline option

 * (Previous approaches/discussions included reverting
    bind9-libs to single-threaded, but we concluded it
    would have more regression risk than the expected
    [some bits in comment #8, and some internal chat],
    and remove exported symbols (apparently unused, but).
    We also considered a mutex/spinlock approach, but
    later found a simpler way w/ isc lib; comment #13.)

[Test Plan]

 * Synthetic reproducer with GDB to force the race
   condition, and DHCP server/client/noise injection
   is described in comment #9.

 * Test with the original package (problem occurs).

 * Test with the modified package (problem fixed).
   - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
   - Set dhcp.fd_flags_poke=0 (problem occurs).

[Regression Potential]

 * 1) dhclient failing to acquire DHCP leases.

 * 2) dhcpd is also affected by code changes,
   thus failures to handle DHCP lease requests
   also have potential for regressions.

 * 3) the functional change added by the fix,
   if a regression were to occur, would likely
   be an issue only under some (unknown) race
   condition as well, thus expected to be rare.

 * Note: this potentially affects Focal/Jammy
   on Azure as a whole, per usage of dhclient
   in cloud-init instead of systemd-networkd.

   Azure provided extensive testing for all 3
   approaches (mostly internal communications,
   and some bug comments), with ~13k instances.

   No issues were observed (previously: 0.4%).

 * Such testing scale seems to indicate that
   there are no regressions for dhclient to
   acquire DHCP leases (1), nor another race
   condition that hit the fix/new behavior (3).

   With that, apparently (2) should be OK too.

 * Also, so to mitigate the regression risk
   as much as possible, there's very detailed
   analysis provided here (comments #17, #18)
   and more information about the fix in its
   patch file's comment.

Revision history for this message
Martijn van Duren (cloudvps-martijn) wrote :
description: updated
Revision history for this message
Chris Patterson (cjp256) wrote :

We've been investigating a similar issue in Ubuntu 20.04 (and now 22.04) on Azure where Running PPS re-use fails to perform DHCP for 5 minutes when dhclient is invoked by cloud-init. dhclient is run by cloud-init, but sees no DHCPOFFER. It varies due to unknown reasons but it has affected a ~0.3-2% of deployments in this scenario over time.

We instrumented our images to capture network traffic and see what is happening and sure enough DHCP offers are coming through to the guest by dhclient doesn't see them. We instrumented dhclient and the "got_one()" callback is never invoked in these failures.

18.04 does not have this issue.

This behavior can be reproduced multiple ways:
- Reproduce similar test environment to above scenario using cloud-init (switch hyperv nic to a different vnet while waiting the link status to reset, then perform dhcp). This test case will reproduce in ~1,500 runs, though it varies and requires more complex setup.
- Repeatedly run dhclient in a loop until it fails (see test-sequential.sh). It may take a while, but even this simple test will reproduce this behavior in ~50k runs for me in an LXD VM.
- Simply launch instances of dhclient in parallel (see test-parallel.sh). There is an excellent chance at least one of those dhclients will fail this way.

I noticed the uprev of bind9 libs in focal:
focal (net): 1:9.11.16+dfsg-3~build1
focal-updates (net): 1:9.11.16+dfsg-3~ubuntu1
impish (net): 1:9.11.19+dfsg-2.1ubuntu1
jammy (net): 1:9.11.19+dfsg-2.1ubuntu3
kinetic (net): 1:9.11.19+dfsg-2.1ubuntu3

I couldn't find any related issue on the isc-dhcp tracker, etc. I did build dhclient from the Debian master branch (https://salsa.debian.org/debian/isc-dhcp/-/commits/master/debian) which uses the in-tree bind libs and that seems to have addressed the issue for all scenarios. Not that it helps much to bisect this just yet.

Revision history for this message
Chris Patterson (cjp256) wrote :
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for Focal which fixes this bug.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for Jammy which fixes this bug.

tags: added: focal jammy sts
no longer affects: isc-dhcp (Ubuntu Focal)
no longer affects: isc-dhcp (Ubuntu Jammy)
Changed in isc-dhcp (Ubuntu):
status: New → Invalid
Changed in bind9-libs (Ubuntu Focal):
status: New → In Progress
Changed in bind9-libs (Ubuntu Jammy):
status: New → In Progress
Changed in bind9-libs (Ubuntu):
status: New → Fix Released
Changed in bind9-libs (Ubuntu Focal):
importance: Undecided → High
Changed in bind9-libs (Ubuntu Jammy):
importance: Undecided → High
Changed in bind9-libs (Ubuntu Focal):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in bind9-libs (Ubuntu Jammy):
assignee: nobody → Matthew Ruffell (mruffell)
summary: - dhclient doesn't receive dhcp offer from kernel
+ dhclient: thread concurrency race leads to DHCPOFFER packets not being
+ received
Revision history for this message
Matthew Ruffell (mruffell) wrote :

packet capture from a reproduction run

description: updated
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Screenshot of wireshark.

description: updated
tags: added: sts-sponsor
tags: removed: sts-sponsor
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Matthew,

Thanks for the excellent analysis and considerate fix proposal, as always!

I looked at this for the last couple of days, for potential sponsorship.

I have attentively gone through the SRU template and Other Info section,
and considered the proposal to switch bind9-libs into --disable-threads,
with the goal of not only address this issue, but also prevent others:

> So, we have two options for a fix for Focal and Jammy:
>
> 1) We disable threading for dhclient.
> 2) We add in a mutex to resolve this particular concurrency issue.
> [...]
> I think if we fix the problem, another issue will crop up in six months
> time, and it will be another concurrency issue.

...

I'm aware you realize such change is concerning :) thus explained it well.

Changing this is Focal (around for almost 3 years) brings regression risk
to an amount I have the _impression_ the SRU team would not be okay with.

And even though I agree with your analysis, proposal and risk assessment,
I'm a bit concerned too, specially as this touches DHCP / IP addressing.

(I'm also very aware this is ultimately their call, not mine at all. :)

...

However, considering how much work and time have likely gone into this
(and internal status) I can't just say 'no' without trying to help out.

I'd like to bring a different opinion.

The reason it's concerning is the very same reason 2) is reasonable:

This concurrency issue (and potential for other concurrency issues)
has been around with Focal since 2020/04 (~3 years), and until now,
its impact does not seem to statistically significant:

> This happens about once every 100 reboots on bare metal, or [...]
> affecting between ~0.3% to 2% of deployments on Microsoft Azure.

So, if there's a way to fix this particular concurrency issue with
less regression risk, that might be worth it, as it would build on
top of dhclient's life on Focal, instead of starting it over again.

...

So, while reviewing the source code for your analysis, I had ideas.

First, a synthetic reproducer with GDB that works every time.

Second, a patch that addressed the issue with the test above.
(It's not final form, I'd like to add a way to turn it off.)

...

Could you please review and verify both, and share your
thoughts on possibly going with that proposal instead?

Of course, if you disagree with the argument or approach,
or if turns out not to work on your end/tests, that's OK!

We would defer this to the Foundations team and SRU team.

- Test steps in the next comment.
- Test packages in ppa:mfo/lp1926139 [1].
- Debdiff attached for reference (code has details).

(Right now only Focal patches/packages are available.
I can go look at Jammy depending on your feedback.)

Hope this helps, after all.
Thanks again!

[1] https://launchpad.net/~mfo/+archive/ubuntu/lp1926139

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (9.1 KiB)

Reproducer based on GDB and DHCP noise injection.

It uses 3 veth pairs (DHCP server/client/injector,
the latter two under namespaces) on a linux bridge.

LXD VM:

 $ lxc launch ubuntu:focal lp1926139-focal --vm
 $ lxc shell lp1926139-focal

Network Setup:

 # ip link add br0 type bridge
 # ip link set br0 up

 # ip link add veth0 type veth peer name veth0br
 # ip link set veth0 up
 # ip link set veth0br up master br0

 # ip netns add ns1
 # ip link add veth1 netns ns1 type veth peer name veth1br
 # ip -n ns1 link set veth1 up
 # ip link set veth1br up master br0

 # ip netns add ns2
 # ip link add veth2 netns ns2 type veth peer name veth2br
 # ip -n ns2 link set veth2 up
 # ip link set veth2br up master br0

Network Check:

 # ip link show type veth | grep veth
 5: veth0br@veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP mode DEFAULT group default qlen 1000
 6: veth0@veth0br: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
 7: veth1br@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP mode DEFAULT group default qlen 1000
 8: veth2br@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP mode DEFAULT group default qlen 1000

 # ip -n ns1 link show type veth | grep veth
 2: veth1@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

 # ip -n ns2 link show type veth | grep veth
 2: veth2@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

DHCP Server Setup:

 # apt install -y isc-dhcp-server

 # ip addr add 192.168.42.1/24 dev veth0

 # echo 'INTERFACESv4="veth0"' >>/etc/default/isc-dhcp-server

 # cat <<EOF >>/etc/dhcp/dhcpd.conf
 subnet 192.168.42.0 netmask 255.255.255.0 {
   range 192.168.42.100 192.168.42.200;
 }
 EOF

 # systemctl restart isc-dhcp-server.service
 # systemctl status isc-dhcp-server.service | grep Active:
      Active: active (running) since Thu 2023-01-19 02:06:18 UTC; 19s ago

 # ss -nlp | grep 0.0.0.0:67
 udp UNCONN 0 0 0.0.0.0:67 0.0.0.0:* users:(("dhcpd",pid=3279,fd=9))

DHCP Server Check:

 # ip netns exec ns1 \
   dhclient -v veth1
 ...
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0xd147ab17)
 DHCPOFFER of 192.168.42.100 from 192.168.42.1
 DHCPREQUEST for 192.168.42.100 on veth1 to 255.255.255.255 port 67 (xid=0x17ab47d1)
 DHCPACK of 192.168.42.100 from 192.168.42.1 (xid=0xd147ab17)
 bound to 192.168.42.100 -- renewal in 245 seconds.

 # ip netns exec ns1 \
   dhclient -v veth1 -r
 ...
 DHCPRELEASE of 192.168.42.100 on veth1 to 192.168.42.1 port 67 (xid=0x1cd4aacf)

DHCP Noise Setup:

 # ip -n ns2 addr add 192.168.42.2/24 dev veth2

 # ip netns exec ns2 \
   /bin/sh -c 'while sleep 0.1; do echo; done | nc -u -v -b -s 192.168.42.2 -p 67 255.255.255.255 68' &
 Connection to 255.255.255.255 68 port [udp/bootpc] succeeded!

 i.e., every 0.1 seconds, broadcast a message as DHCP (port 67) to DHCP client receive (port 68).

DH...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Revision history for this message
Chris Patterson (cjp256) wrote :

Great work Maurico, I think you make several excellent points and I appreciate your efforts on a better reproducer and alternative patch. FWIW I began testing the Matthew's initial build (which disabled threads) against a large number of VMs and that appeared to address the issues we're seeing. I'm cutting those tests short and am updating the tests now to use your patch as provided by Matthew and we'll see how that goes!

description: updated
no longer affects: bind9-libs (Ubuntu Focal)
no longer affects: bind9-libs (Ubuntu Jammy)
Changed in bind9-libs (Ubuntu):
status: Fix Released → Won't Fix
no longer affects: bind9-libs (Ubuntu Focal)
no longer affects: bind9-libs (Ubuntu Jammy)
Changed in isc-dhcp (Ubuntu Focal):
status: New → In Progress
Changed in isc-dhcp (Ubuntu Jammy):
status: New → In Progress
Changed in isc-dhcp (Ubuntu Focal):
importance: Undecided → High
Changed in isc-dhcp (Ubuntu Jammy):
importance: Undecided → High
Changed in isc-dhcp (Ubuntu Focal):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in isc-dhcp (Ubuntu Jammy):
assignee: nobody → Mauricio Faria de Oliveira (mfo)
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Matthew, Chris,

Apparently there's a simpler, less intrusive, and more specific way to do this.

Apologies that I missed this earlier, but I found more about the possibilities in bind9-libs functions while checking the previous fix approach for regressions.

Could you please provide your thoughts, Matthew?

If it looks good for you, please feel free to discuss additional testing with Chris, if at all possible.

P.S.: the workaround disable switches are in, via environment variable and kernel cmdline option.

Thanks!
Mauricio

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test packages:

https://launchpad.net/~mfo/+archive/ubuntu/lp1926139
isc-dhcp 4.4.1-2.1ubuntu5.20.04.4+lp1926139.2

Default behavior: issue fixed.
---

 (gdb) break omapip/dispatch.c:333
 (gdb) commands
 shell sleep 0.2
 continue
 end

 (gdb) run -d -v veth1
 ...
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0x9679b264)
 DHCPOFFER of 192.168.42.100 from 192.168.42.1
 DHCPREQUEST for 192.168.42.100 on veth1 to 255.255.255.255 port 67 (xid=0x64b27996)
 DHCPACK of 192.168.42.100 from 192.168.42.1 (xid=0x9679b264)
 ...
 ^C
 (gdb) kill

 Release address.

 (gdb) run -d -v veth1 -r
 ...

Original behavior with environment variable: issue observed.
---

 (gdb) set environment DHCP_FD_FLAGS_POKE 0
 (gdb) run -d -v veth1
 ...
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0xc2db3363)
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 5 (xid=0xc2db3363)
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 5 (xid=0xc2db3363)
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 6 (xid=0xc2db3363)
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 16 (xid=0xc2db3363)
 ^C
 ...
 (gdb) kill

 (gdb) unset environment DHCP_FD_FLAGS_POKE

Original behavior with kernel cmdline option: issue observed.
---

 (gdb) shell echo "$(cat /proc/cmdline) dhcp.fd_flags_poke=0" >/tmp/cmdline
 (gdb) shell mount --bind /tmp/cmdline /proc/cmdline
 (gdb) shell cat /proc/cmdline
 BOOT_IMAGE=/boot/vmlinuz-5.4.0-1084-kvm root=PARTUUID=a1286399-334e-4597-b30f-da227b6c076b ro console=tty1 console=ttyS0 panic=-1 dhcp.fd_flags_poke=0

 (gdb) run -d -v veth1
 ...
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0x938a6b0b)
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 4 (xid=0x938a6b0b)
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 8 (xid=0x938a6b0b)
 DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 20 (xid=0x938a6b0b)
 ^C
 ...
 (gdb) kill

 (gdb) shell umount /proc/cmdline

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Jammy/22.04:
- test packages in ppa:mfo/lp1926139
- reproduction steps delta (based on comment #9)

...

Reproducer based on GDB and DHCP noise injection.

It uses 3 veth pairs (DHCP server/client/injector,
the latter two under namespaces) on a linux bridge.

...

LXD VM:

 lxc launch ubuntu:jammy lp1926139-jammy --vm
 lxc shell lp1926139-jammy

GDB Reproducer (original package):
==============

Debug symbols:

 # wget https://launchpad.net/ubuntu/+archive/primary/+files/isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.3_amd64.ddeb
 # apt install -y ./isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.3_amd64.ddeb

Source code line numbers (for breakpoint):

 198 isc_result_t omapi_register_io_object (omapi_object_t *h,
 ...
 259 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
 ...
 279 for (p = omapi_io_states.next;

Attempt to reproduce the issue
with a delay introduced via breakpoint on line 279:

 # ip netns exec ns1 \
   gdb -ex 'set target-async on' -ex 'set non-stop on' -ex 'set pagination off' -ex 'set confirm off' -q dhclient

 (gdb) break omapip/dispatch.c:279
 (gdb) commands
 shell sleep 0.2
 continue
 end
 (gdb) run -v -d veth1

GDB Reproducer (patched package):
==============

Client & Debug symbols:

 # wget \
   https://launchpad.net/~mfo/+archive/ubuntu/lp1926139/+files/isc-dhcp-client_4.4.1-2.3ubuntu2.3+lp1926139.2_amd64.deb \
   https://launchpad.net/~mfo/+archive/ubuntu/lp1926139/+files/isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.3+lp1926139.2_amd64.ddeb

  # sudo apt install \
     ./isc-dhcp-client_4.4.1-2.3ubuntu2.3+lp1926139.2_amd64.deb \
     ./isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.3+lp1926139.2_amd64.ddeb

Source code line numbers (for breakpoint):

  233 isc_result_t omapi_register_io_object (omapi_object_t *h,
 ...
  312 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
 ...
  333 for (p = omapi_io_states.next;

Attempt to reproduce the issue again, the same way,
with a delay introduced via breakpoint on line 333:

 # ip netns exec ns1 \
   gdb -ex 'set target-async on' -ex 'set non-stop on' -ex 'set pagination off' -ex 'set confirm off' -q dhclient

 (gdb) break omapip/dispatch.c:333
 (gdb) commands
 shell sleep 0.2
 continue
 end
 (gdb) run -v -d veth1

...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Pressing Page Down 17 times to go over the bug description sounds like a new record! ;-)

Just in case future reviewers don't find that as exciting while going through reviews, I'll move some text into comments and reference them from the description.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (18.1 KiB)

From the 'Other Info' section.
Detailed analysis of the issue and more.

[Other Info]

Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

When you tcpdump dhclient, we see all DHCPDISCOVER packets being replied to with DHCPOFFER packets, but the got_one() callback is never called, dhclient does not read these DHCPOFFER packets, and continues sending DHCPDISCOVER packets. Once it reaches 25 DHCPDISCOVER packets sent, it gives up.

This behaviour led several bug reporters to believe it was a kernel issue, with the kernel not pushing DHCPOFFER packets to dhclient. This is not the case, the actual problem is dhclient containing a thread concurrency race condition, and when the race occurs, the read socket is closed prematurely, and dhclient does not read any of the DHCPOFFER replies.

tcpdump: 'Screenshot of Wireshark' attached.

...

I was reading around the upstream issue trackers, and found the following two bug reports:

https://gitlab.isc.org/isc-projects/dhcp/-/issues/264
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996356

The ISC upstream report was actually quite detailed, and it has the same symptoms of what we are experiencing.

Let's have a look at the root cause. The code I am using is isc-dhcp
4.4.1-2.1ubuntu5.20.04.4 from Focal.

common/discover.c

 567 void
 568 discover_interfaces(int state) {
...
1002 case AF_INET:
1003 default:
1004 status = omapi_register_io_object((omapi_object_t *)tmp,
1005 if_readsocket,
1006 0, got_one, 0, 0);
1007 break;
1008 }
...

In discover.c, we call discover_interfaces() to iterate over the interfaces, and attempt to register a raw socket against it. We do this by calling omapi_register_io_object() which is used for reading data, and calls the elusive got_one() callback that you instrumented your code to see if it was being called or not.

omapip/dispatch.c

196 /* Register an I/O handle so that we can do asynchronous I/O on it. */
197
198 isc_result_t omapi_register_io_object (omapi_object_t *h,
199 int (*readfd) (omapi_object_t *),
200 int (*writefd) (omapi_object_t *),
201 isc_result_t (*reader)
202 (omapi_object_t *),
203 isc_result_t (*writer)
204 (omapi_object_t *),
205 isc_result_t (*reaper)
206 (omapi_object_t *))
207 {
...
241 /*
242 * Attach the I/O object to the isc socket library via the
243 * fdwatch function. This allows the socket library to watch
244 * over a socket that we built. If there are both a read and
245 * a write socket we asssume they are the same socket.
246 */
247
248 if (readfd) {
249 fd_flags |= ISC_SOCKFDWATCH_READ;
250 fd = readfd(h);
251 }
...
257
258 if (fd_flags != 0) {
259 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
260 fd, fd_flags,
261 omapi_iscsock_cb,
262 ...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

From the 'Where problems could occur' section.
Considerations on regressions and approaches.

[Where problems could occur]

isc-dhcp is a core package, and any change comes with the risk that users would not be able to receive dhcp leases with dhclient, leaving their systems with no IP address and unreachable, and could potentially cripple images that depend on it, e.g. Microsoft Azure uses dhclient called from cloud-init, instead of systemd-networkd, so a regression could potentially affect all Ubuntu users on Azure.

Additionally, the code is called whenever sockets are constructed, and isc-dhcp-server could also be affected.

We have mitigated the risks of regression as best as possible by adding as much detail as possible to this launchpad bug, so it is clear how the race operates and how the patch fixes the issue.

Mauricio has additionally added a environment variable and a kernel command line parameter, that when present, disables the fix from operating. If a regression were to occur, users can add these parameters to their deployments to work around any issues.

Mauricio and Matthew have decided that the individual fix route is best in terms of lessening regression risk, as the alternate solution would be to disable threading on bind9-libs.

Disabling threading on bind9-libs, while complete as a solution, and removes the risk of a future regression caused by thread concurrency issues that are currently undetected, comes with the fact that it removes publicly exported symbols from bind9-libs, and adds others, and changes the entire library from multithreaded to single threaded. If any users happen to use bind9-libs outside of isc-dhcp, they would see their applications either fail to work due to missing symbols, or performance would change.

Disabling threading on bind9-libs is shelved, and can be looked at in the future if necessary.

Back to the individual fix solution, Chris Patterson, has been testing this solution at scale on Azure, and in 13k instances, has not had a failure. With the gdb reproducer, we are confident that adding the mutex will not prevent other parts of the software from functioning correctly.

description: updated
tags: added: se-sru-sponsor-mfo
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Martijn, or anyone else affected,

Accepted isc-dhcp into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/isc-dhcp/4.4.1-2.3ubuntu2.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in isc-dhcp (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Martijn, or anyone else affected,

Accepted isc-dhcp into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/isc-dhcp/4.4.1-2.1ubuntu5.20.04.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in isc-dhcp (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed-focal
FinnRM (finn-rm)
Changed in isc-dhcp (Ubuntu Focal):
status: Fix Committed → Fix Released
status: Fix Released → Fix Committed
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (3.8 KiB)

verification done on jammy
(full steps in comment #9)

The issue is not reproducible with the package in -proposed,
and is reproducible with the 2 switches for the old behavior
(DHCP_FD_FLAGS_POKE=0 or dhcp.fd_flags_poke=0), as expected.

...

# add-apt-repository -yp proposed

# apt policy isc-dhcp-client
isc-dhcp-client:
  Installed: 4.4.1-2.3ubuntu2.3
  Candidate: 4.4.1-2.3ubuntu2.4
  Version table:
     4.4.1-2.3ubuntu2.4 500
        500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
...

# wget https://launchpad.net/ubuntu/+archive/primary/+files/isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.4_amd64.ddeb

# apt install -y isc-dhcp-client ./isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.4_amd64.ddeb gdb

Source code line numbers (for breakpoint):

 233 isc_result_t omapi_register_io_object (omapi_object_t *h,
 ...
 312 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
 ...
 333 for (p = omapi_io_states.next;

# ip netns exec ns1 \
  gdb -ex 'set target-async on' -ex 'set non-stop on' -ex 'set pagination off' -ex 'set confirm off' -q dhclient

(gdb) break omapip/dispatch.c:333
(gdb) commands
shell sleep 0.2
continue
end
(gdb) run -d -v veth1
...
Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object (h=0x558f3b9b1180, readfd=0x558f3b00f150 <if_readsocket>, writefd=0x0, reader=0x558f3b0114b0 <got_one>, writer=0x0, reaper=0x0) at ../omapip/dispatch.c:343
343 ../omapip/dispatch.c: No such file or directory.
Sending on Socket/fallback
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0x6909975b)
DHCPOFFER of 192.168.42.100 fbreak omapip/dispatch.c:333rom 192.168.42.1
DHCPREQUEST for 192.168.42.100 on veth1 to 255.255.255.255 port 67 (xid=0x5b970969)
DHCPACK of 192.168.42.100 from 192.168.42.1 (xid=0x6909975b)
[Detaching after fork from child process 34351]
bound to 192.168.42.100 -- renewal in 301 seconds.
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...

(gdb) run -d -v veth1 -r
...
DHCPRELEASE of 192.168.42.100 on veth1 to 192.168.42.1 port 67 (xid=0x270d7f02)
...

<<< WORKS 10/10 >>>

...

(gdb) set environment DHCP_FD_FLAGS_POKE 0

(gdb) run -d -v veth1
...
Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object (h=0x55e3f8364180, readfd=0x55e3f7828150 <if_readsocket>, writefd=0x0, reader=0x55e3f782a4b0 <got_one>, writer=0x0, reaper=0x0) at ../omapip/dispatch.c:343
343 ../omapip/dispatch.c: No such file or directory.
Sending on Socket/fallback
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0xd7155345)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 5 (xid=0xd7155345)
DHCPDISCOVER on veth1 to 255break omapip/dispatch.c:333.255.255.255 port 67 interval 6 (xid=0xd7155345)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 7 (xid=0xd7155345)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 7 (xid=0xd7155345)
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...
(gdb) kill

<<< FAILS 3/3 >>>

(gdb) unset environment DHCP_FD_FLAGS_POKE

...

(gdb) shell echo "$(cat /proc/cmdline) dhcp.fd_flags_poke=0" >/tmp/cmdline
(gdb) shell mount --bind /tmp/cmdline /proc/cmdline
(gdb) shell cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.0-10...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (3.9 KiB)

verification done on focal
(full steps in comment #9)

The issue is not reproducible with the package in -proposed,
and is reproducible with the 2 switches for the old behavior
(DHCP_FD_FLAGS_POKE=0 or dhcp.fd_flags_poke=0), as expected.

...

# add-apt-repository -y 'deb http://archive.ubuntu.com/ubuntu focal-proposed main'

# apt policy isc-dhcp-client
isc-dhcp-client:
  Installed: 4.4.1-2.1ubuntu5.20.04.4
  Candidate: 4.4.1-2.1ubuntu5.20.04.5
  Version table:
     4.4.1-2.1ubuntu5.20.04.5 500
        500 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages
...

# wget https://launchpad.net/ubuntu/+archive/primary/+files/isc-dhcp-client-dbgsym_4.4.1-2.1ubuntu5.20.04.5_amd64.ddeb

# apt install -y isc-dhcp-client ./isc-dhcp-client-dbgsym_4.4.1-2.1ubuntu5.20.04.5_amd64.ddeb gdb

Source code line numbers (for breakpoint):

 233 isc_result_t omapi_register_io_object (omapi_object_t *h,
 ...
 312 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
 ...
 333 for (p = omapi_io_states.next;

# ip netns exec ns1 \
  gdb -ex 'set target-async on' -ex 'set non-stop on' -ex 'set pagination off' -ex 'set confirm off' -q dhclient

(gdb) break omapip/dispatch.c:333
(gdb) commands
shell sleep 0.2
continue
end
(gdb) run -d -v veth1
...
Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object (h=0x561afb034940, readfd=0x561afad13630 <if_readsocket>, writefd=writefd@entry=0x0, reader=0x561afad30fb0 <fallback_discard>, writer=writer@entry=0x0, reaper=reaper@entry=0x0) at dispatch.c:337
337 in dispatch.c
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0x72ac8a14)
DHCPOFFER of 192.168.42.100 from 192.168.42.1
DHCPREQUEST for 192.168.42.100 on veth1 to 255.255.255.255 port 67 (xid=0x148aac72)
DHCPACK of 192.168.42.100 from 192.168.42.1 (xid=0x72ac8a14)
[Detaching after fork from child process 1037683]
bound to 192.168.42.100 -- renewal in 290 seconds.
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...

(gdb) run -d -v veth1 -r
...
DHCPRELEASE of 192.168.42.100 on veth1 to 192.168.42.1 port 67 (xid=0x20449570)
...

<<< WORKS 10/10 >>>

...

(gdb) set environment DHCP_FD_FLAGS_POKE 0

(gdb) run -d -v veth1
...
Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object (h=0x557c0d5d1350, readfd=0x557c0beb8630 <if_readsocket>, writefd=0x0, reader=0x557c0bed5fb0 <fallback_discard>, writer=0x0, reaper=0x0) at dispatch.c:337
337 in dispatch.c
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0xa5a2783d)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 6 (xid=0xa5a2783d)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 8 (xid=0xa5a2783d)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 10 (xid=0xa5a2783d)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 21 (xid=0xa5a2783d)
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...
(gdb) kill

<<< FAILS 3/3 >>>

(gdb) unset environment DHCP_FD_FLAGS_POKE

...

(gdb) shell echo "$(cat /proc/cmdline) dhcp.fd_flags_poke=0" >/tmp/cmdline
(gdb) shell mount --bind /tmp/cmdline /proc/cmdline
(gdb) shell cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.4.0-1084-kvm root=PARTUUID=7a9ea63e-b971-413c-9238-d59509520a9e ro co...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

The -proposed packages have also been verified by the Microsoft Azure team.
14k+ VM deployments with the fix didn't observe the issue over the weekend.

Marking jammy/focal as verified based on that plus 2 synthetic tests above.

tags: added: verification-done verification-done-focal verification-done-jammy
removed: verification-needed verification-needed-focal verification-needed-jammy
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package isc-dhcp - 4.4.1-2.3ubuntu2.4

---------------
isc-dhcp (4.4.1-2.3ubuntu2.4) jammy; urgency=medium

  [ Mauricio Faria de Oliveira ]
  * Prevent race condition that might ignore DHCP OFFERs/ACKs
    when dhclient receives DHCP traffic noise. (LP: #1926139)
    The previous/racy behavior can be switched back on with
    the 'DHCP_FD_FLAGS_POKE=0' environment variable or
    the 'dhcp.fd_flags_poke=0' kernel cmdline option.
    - d/p/lp1926139-watch-socket-fd-later.patch: fix, switches.
    - d/apparmor/sbin.dhclient,usr.sbin.dhcpd: /proc/cmdline r.

  [ Steve Langasek ]
  * Include /etc/dhcp/dhclient-exit-hooks.d/rfc3442-classless-routes
    in the initramfs. (LP: #1937110)
    - d/initramfs-tools/share/hooks/zz-dhclient: copy_exec it.

 -- Mauricio Faria de Oliveira <email address hidden> Tue, 31 Jan 2023 18:54:40 -0300

Changed in isc-dhcp (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for isc-dhcp has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package isc-dhcp - 4.4.1-2.1ubuntu5.20.04.5

---------------
isc-dhcp (4.4.1-2.1ubuntu5.20.04.5) focal; urgency=medium

  [ Mauricio Faria de Oliveira ]
  * Prevent race condition that might ignore DHCP OFFERs/ACKs
    when dhclient receives DHCP traffic noise. (LP: #1926139)
    The previous/racy behavior can be switched back on with
    the 'DHCP_FD_FLAGS_POKE=0' environment variable or
    the 'dhcp.fd_flags_poke=0' kernel cmdline option.
    - d/p/lp1926139-watch-socket-fd-later.patch: fix, switches.
    - d/apparmor/sbin.dhclient,usr.sbin.dhcpd: /proc/cmdline r.

  [ Steve Langasek ]
  * Include /etc/dhcp/dhclient-exit-hooks.d/rfc3442-classless-routes
    in the initramfs. (LP: #1937110)
    - d/initramfs-tools/share/hooks/zz-dhclient: copy_exec it.

 -- Mauricio Faria de Oliveira <email address hidden> Tue, 31 Jan 2023 19:10:35 -0300

Changed in isc-dhcp (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.