network-manager fails to renew ipv6 address

Bug #1969901 reported by Håkan Kvist
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
network-manager (Ubuntu)
Fix Released
Low
Unassigned
Bionic
Won't Fix
Low
Unassigned

Bug Description

[Impact]

 * This affects Ubuntu 18.04 where Network Manager version 1.10.6 is used.

 * Network manager might kill dhclient(6) and fail to start it again
   causing the IPv6 address to be lost on a network that uses mixed
   IPv4/IPV6.
   The network status will still be seen as online in gnome since ipv4
   is still active.
   The user then have to manually remove the dhcpv6 lease files and
   restart ipv6 connection/restart network manager to regain IPv6
   connectivity.

 * This is a cherry-pick from Network manager 1.10.8 (Ubuntu's version
   is based on 1.10.6):
   https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/7fbbe7ebee99785e38d39c37e515a64a28edef0f

 * Upstream bug:
   https://bugzilla.gnome.org/show_bug.cgi?id=783391

[Test Plan]

The test requires three computers
* one computer runing isc dhcpd server (with network configured static)
* one computer running patched network manager
* one computer running vanilla ubuntu

The idea is to execute on an isolated network and trigger the error by changing
ip range handed out by the dhcp-server to force a nack response back to the
clients.

Expected result
* patched network keeps dhclient6 alive
* vanilla network manager will fail to keep dhclient6 alive
  in network manager logs dhcp6 will expire and not be restarted

ON THE SERVER
# Disable app-armor, as it has rules restricting dhcpd
aa-teardown

# install isc dhcpserver
isc-dhcp-server

# configure network static
sudo nmcli connection modify "${CONNECTION_NAME}" \
     ipv4.method "manual" \
     ipv4.addresses "192.168.1.1/24" \
     ipv4.gateway "192.168.1.254" \
     ipv4.dns "192.168.1.1" \
     ipv6.method "manual" \
     ipv6.addresses "2001:db8:0:1::1/64" \
     ipv6.gateway "2001:db8:0:1::ffbb" \
     ipv6.dns "2001:db8:0:1::1/64"

mkdir -p tmp
touch tmp/dhcpd4_a.leases
touch tmp/dhcpd4_b.leases
touch tmp/dhcpd6_a.leases
touch tmp/dhcpd6_b.leases

Then it is time to execute dhcpd
-f - run in foreground
-d - print errors to stderr instead of syslog

# Start in separate terminals
dhcpd -f -d -4 -cf dhcp_v4_a.conf -lf tmp/dhcpd4_a.leases enp0s31f6
dhcpd -f -d -6 -cf dhcp_v6_a.conf -lf tmp/dhcpd6_a.leases enp0s31f6

Press ctrl-C to kill servers, then restart with the b configurations

dhcpd -f -d -4 -cf dhcp_v4_b.conf -lf tmp/dhcpd4_b.leases enp0s31f6
dhcpd -f -d -6 -cf dhcp_v6_b.conf -lf tmp/dhcpd6_b.leases enp0s31f6

Then leases to expire (check for clients that kills dhclient)
Press ctrl-C to kill servers, then restart with the a configurations

ON THE CLIENTS
Setup:
Configure ipv6 network in settings to use dhcp (using the gui)

Test:
check that dhcp clients are still running:
ps aux|grep dhclient

Expected in output
one client for dhcpv4
one client for dhcpv6

Also check network manager status :
journalctl -u NetworkManager.service
journalctl -u NetworkManager.service|grep dhcp6 # to only view dhclient6
journalctl -u NetworkManager.service|grep dhcp4 # to only view dhclient4

if dhclient is not running:
stop network in gui
remove lease files (/var/lib/NetworkManager/dhclient*.lease). Only remove the lease for the client not running.
start network in gui

if dhclients are running:
wait additional ten minutes, repeat from beginning of test

[Where problems could occur]

 * The change is in the dchp lease expiration handling so verify that there is no regression in dhcp renewals on different type of configuration include IPv6

[Other Info]
 * We have tested this patch on a couple of clients where we have seen this
   this problem. If this patch is feasible to include in Ubuntu 18.04 we
   could request more users to test.

Tags: patch
description: updated
description: updated
Revision history for this message
Håkan Kvist (hakankvist) wrote :
description: updated
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in network-manager (Ubuntu):
status: New → Confirmed
Revision history for this message
ArchPhoenix team (archphoenix) wrote (last edit ):

Was able to reproduce, my workaround was a systemd timer to clear the dhcp lease files and restart networkmanager... which is rather violent.

Confirmed client side problem, Win7,Win10,MacOS Montemery and Android 11 have no problems and do not lose their ipv6 on the same network.

Revision history for this message
Sebastien Bacher (seb128) wrote (last edit ):

Thanks, the issue is fixed in newer series and I've sponsored your bionic SRU now and updated the description to be a bit more specific about the impact and what to verify

Changed in network-manager (Ubuntu):
importance: Undecided → Low
status: Confirmed → Fix Released
description: updated
Revision history for this message
Robie Basak (racb) wrote :

> The exact conditions for reproducing this bug on mixed IPv4/IPv6 networks are not known

Looking at the upstream commit description, isn't it just that a DHCPv6 lease expires and the server NAKs a request for the same IP again? Or is that not sufficient to trigger the problem.

In any case, I appreciate there might be difficulty in testing the fix, but what's the actual criteria you propose to use to decide when the bug is to be marked verification-done-bionic?

Changed in network-manager (Ubuntu Bionic):
status: New → Incomplete
Revision history for this message
Håkan Kvist (hakankvist) wrote :

>> The exact conditions for reproducing this bug on mixed IPv4/IPv6 networks are not known
>
>Looking at the upstream commit description, isn't it just that a DHCPv6 lease expires and the >server NAKs a request for the same IP again? Or is that not sufficient to trigger the problem.
>
Yes, when having a look at the previously collected logs, that seems to be the case (journalctl -u NetworkManager.service).

Some of our computers get this problem more often that other. Some persons claim that they have never seen the problem. That is the part that is unclear.

>In any case, I appreciate there might be difficulty in testing the fix, but what's the actual >criteria you propose to use to decide when the bug is to be marked verification-done-bionic?

In the best of worlds I would like to simulated environment where dhcp-packages could be controlled, but that is not feasible.

I have been running this patch on two computers since 2022-04-13 and haven't seen the problem since. One laptop (restarted every day) and one desktop that is always on.
The desktop has been running for 29 days continuously according to syslog, without me having to manually remove dhcp lease files and restart network manager.

Ideas for getting confidence of this change:
We could ask more users who have experienced this error to install this change and confirm if they experience lost ipv6 addresses after installing patched version.

Another idea is to shutdown computer in single user mode (without network), edit the dhcp6 lease file in a way so that dhcp-server will respond with NACK when booting up in multi-user mode.
What to change in the lease file I do not know.

Revision history for this message
Robie Basak (racb) wrote :

Thank you for the comments. These seem like good ideas, but need details before they are actionable.

I think this update is still blocked on having a specific, step-by-step test plan please.

Revision history for this message
Håkan Kvist (hakankvist) wrote (last edit ):

We are investigating proposals for a test plane, but we need to come up with something that works.

The former proposal written in this comment did not work.

Mathew Hodson (mhodson)
Changed in network-manager (Ubuntu Bionic):
importance: Undecided → Low
Revision history for this message
Håkan Kvist (hakankvist) wrote (last edit ):

Proposal for test plan, we have tested this for 5 iterations, and looking good so far.

The test requires three computers
* one computer runing isc dhcpd server (with network configured static)
* one computer running patched network manager
* one computer running vanilla ubuntu

The idea is to execute on an isolated network and trigger the error by changing
ip range handed out by the dhcp-server to force a nack response back to the
clients.

Expected result
* patched network keeps dhclient6 alive
* vanilla network manager will fail to keep dhclient6 alive
  in network manager logs dhcp6 will expire and not be restarted

ON THE SERVER
# Disable app-armor, as it has rules restricting dhcpd
aa-teardown

# install isc dhcpserver
isc-dhcp-server

# configure network static
sudo nmcli connection modify "${CONNECTION_NAME}" \
     ipv4.method "manual" \
     ipv4.addresses "192.168.1.1/24" \
     ipv4.gateway "192.168.1.254" \
     ipv4.dns "192.168.1.1" \
     ipv6.method "manual" \
     ipv6.addresses "2001:db8:0:1::1/64" \
     ipv6.gateway "2001:db8:0:1::ffbb" \
     ipv6.dns "2001:db8:0:1::1/64"

mkdir -p tmp
touch tmp/dhcpd4_a.leases
touch tmp/dhcpd4_b.leases
touch tmp/dhcpd6_a.leases
touch tmp/dhcpd6_b.leases

Then it is time to execute dhcpd
-f - run in foreground
-d - print errors to stderr instead of syslog

# Start in separate terminals
dhcpd -f -d -4 -cf dhcp_v4_a.conf -lf tmp/dhcpd4_a.leases enp0s31f6
dhcpd -f -d -6 -cf dhcp_v6_a.conf -lf tmp/dhcpd6_a.leases enp0s31f6

Press ctrl-C to kill servers, then restart with the b configurations

dhcpd -f -d -4 -cf dhcp_v4_b.conf -lf tmp/dhcpd4_b.leases enp0s31f6
dhcpd -f -d -6 -cf dhcp_v6_b.conf -lf tmp/dhcpd6_b.leases enp0s31f6

Then leases to expire (check for clients that kills dhclient)
Press ctrl-C to kill servers, then restart with the a configurations

ON THE CLIENTS
Setup:
Configure ipv6 network in settings to use dhcp (using the gui)

Test:
check that dhcp clients are still running:
ps aux|grep dhclient

Expected in output
one client for dhcpv4
one client for dhcpv6

Also check network manager status :
journalctl -u NetworkManager.service
journalctl -u NetworkManager.service|grep dhcp6 # to only view dhclient6
journalctl -u NetworkManager.service|grep dhcp4 # to only view dhclient4

if dhclient is not running:
stop network in gui
remove lease files (/var/lib/NetworkManager/dhclient*.lease). Only remove the lease for the client not running.
start network in gui

if dhclients are running:
wait additional ten minutes, repeat from beginning of test

Revision history for this message
Håkan Kvist (hakankvist) wrote :

Configuration files for dhcpd for the proposed test steps in previous comment.

Revision history for this message
Håkan Kvist (hakankvist) wrote :

Any comments on the proposed test plan?

It worked fine in during the described steps.

Revision history for this message
ArchPhoenix team (archphoenix) wrote :

The status says "Fix Released", was it released for 20.04 and later but not for 18.04 ?

Revision history for this message
Håkan Kvist (hakankvist) wrote :

Ubuntu 20.04 includes a later version of network manager that already contains the fix.

The suggested fix for 18.04 is a backport/cherry-pick of the fix to the older version of network manager included in 18.04.

description: updated
Revision history for this message
Håkan Kvist (hakankvist) wrote :
Jeremy Bícha (jbicha)
Changed in network-manager (Ubuntu Bionic):
status: Incomplete → In Progress
Revision history for this message
Robie Basak (racb) wrote :

The test plan looks good - thanks!

On reviewing the patch itself, it looks quite complex. We took a look upstream to see if there were any further fixes on the commit being cherry-picked, and found at least one. See https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commits/nm-1-10/src/devices/nm-device.c and commit d017022. This doesn't appear to be incorporated into the fix in Bionic being proposed here.

Please could you fully analyze what has occurred upstream to ensure that this patch doesn't have any known issues that have since been fixed upstream? If these are false positives, I'd appreciate an explanation. Thanks!

Changed in network-manager (Ubuntu Bionic):
status: In Progress → Incomplete
Revision history for this message
Håkan Kvist (hakankvist) wrote :

Agreed, commit d017022 seems to be missing.
I will cherry-pick it locally and retest.

I did some more comparing of the file nm-device.c where the changes was made.

I think d017022 is the only relevant follow up change.

I compared the commit with the original fix with on the 1.10 track with latest commit on the 1.10 track by doing:

git blame 7fbbe7ebee99785e38d39c37e515a64a28edef0f -- src/devices/nm-device.c > commit_7fbbe7ebee99785e38d39c37e515a64a28edef0f.txt
git blame remotes/origin/nm-1-10 -- src/devices/nm-device.c > commit_1_10.txt
<diff-tool-of-choice> commit_7fbbe7ebee99785e38d39c37e515a64a28edef0f.txt commit_1_10.txt

I could not find any other connected changes on rows (or close to rows) modified by 7fbbe7ebee99785e38d39c37e515a64a28edef0f except for the changes by d017022.

The commit d7ebbd69a05c8bee636c5eeba2206176ba29bdc3 (core: implement setting MDNS setting for systemd) was changing into the same methods, but it added completely new functionality so not related.

Revision history for this message
Håkan Kvist (hakankvist) wrote :

Updated patch also including d017022dfc7e531c23caddeac7b3a8b03b1aa5d0

we will test this further.

Revision history for this message
Håkan Kvist (hakankvist) wrote :

I have tested the new version of the debdiff including the fix and the fix of the fix.

Using it daily on my laptop and also tested it in the test setup described in the description.
No issues seen so far.

Mathew Hodson (mhodson)
tags: added: patch
Changed in network-manager (Ubuntu Bionic):
status: Incomplete → Triaged
Revision history for this message
Julian Andres Klode (juliank) wrote :

Ubuntu 18.04 has reached it's end of standard support, hence marking won't fix.

Changed in network-manager (Ubuntu Bionic):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.