NetworkManager does not update IPv4 address lifetime even though DHCP lease was successfully renewed

Bug #1696415 reported by Sjors Gielen on 2017-06-07
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
NetworkManager
Confirmed
Medium
network-manager (Ubuntu)
Low
Unassigned
Xenial
Low
Unassigned

Bug Description

SRU REQUEST:

Debdiff (nm-dhcp-helper.debdiff) attached.

Fixed in current Ubuntu zesty and newer: Bionic uses NM 1.8.x. This bug was fixed upstream in 1.4.

[Impact]

 * nm-dhcp-helper sometimes fails to notify NetworkManager of a DHCP
   lease renewal due to a DBus race condition.

 * Upstream NetworkManager 1.4 fixes the race condition by changing
   nm-dhcp-helper's DBus notification from signal "Event" to method
   "Notify".

 * Original bug submitter backported NM 1.4's nm-dhcp-helper notification fix to NM 1.2. This SRU request applies that backported patch to Xenial's
   NM 1.2.x.

[Test Case]

 * Not reliably reproducible. Out of hundreds of machines, only a
   dozen or so fail to notify NetworkManager of a DHCP lease
   renewal about 30-50% of the time. (i.e. It's always the same
   handful of machines that fail.)

 * All such machines with the patched packages have been fine for weeks,
   over many dozens of lease renewals.

[Regression Potential]

* The patch changes both nm-dhcp-helper and NetworkManager itself. As soon as the new packages are unpacked, the new nm-dhcp-helper will be used on DHCP lease renewals, with the new Notify mechanism. Since the running, old NetworkManager is still expecting Event notifications, the patched nm-dhcp-helper has fallback capability to Event.

* Once NetworkManager is restarted and is running the patched version, it will have the new Notify support.

[Other Info]

 * Upstream bug w/ patch: https://bugzilla.gnome.org/show_bug.cgi?id=784636

 * RHEL bug with links to the 1.4 commits from which the patch was
   derived: https://bugzilla.redhat.com/show_bug.cgi?id=1373276

 * NOTE: The final comment on the upstream GNOME bug claims that the fix is incomplete. However, it is possible that the running NetworkManager was not restarted (see Regression Potential notes above), which is why nm-dhcp-helper is falling back to Event. The remainder of the log messages in that final comment are from a custom wrapper the submitter was running around nm-dhcp-helper. I have deployed the exact same patch (without said wrapper) to real-world systems and tested extensively, and see nothing but successful DHCP lease renewal notifications using D-Bus Notify, not D-Bus Event.

----
I've found an issue on some of our Xenial office machines, causing NetworkManager to drop its IP address lease in some cases when it shouldn't. I'm not sure if the actual bug is in NetworkManager or perhaps dbus or dhclient, but I'll do my best to help to figure out where it is.

What appears to happen:
* NetworkManager is informed of a new IPv4 lease.
* During the lease, dhclient keeps it fresh by renewing it using DHCPREQUESTs regularly.
* In spite of this, NetworkManager drops the IP address from the interface when the last reported lease time expires.

This happens on various machines, once every few days. We are using a failover DHCP configuration using two machines (192.168.0.3 'bonaire' and 192.168.0.4 'curacao').

The machine where I've done the debugging is called 'pampus' (192.168.0.166). As you can see in the logs, at 01:21:06 NetworkManager reports a new lease with lease time 7200.

jun 07 01:21:06 pampus dhclient[1532]: DHCPREQUEST of 192.168.0.166 on eth0 to 192.168.0.4 port 67 (xid=0x3295b440)
jun 07 01:21:06 pampus dhclient[1532]: DHCPACK of 192.168.0.166 from 192.168.0.4
jun 07 01:21:06 pampus NetworkManager[1161]: <info> [1496791266.9530] address 192.168.0.166
jun 07 01:21:06 pampus NetworkManager[1161]: <info> [1496791266.9530] plen 24 (255.255.255.0)
jun 07 01:21:06 pampus NetworkManager[1161]: <info> [1496791266.9530] gateway 192.168.0.5
jun 07 01:21:06 pampus NetworkManager[1161]: <info> [1496791266.9530] server identifier 192.168.0.4
jun 07 01:21:06 pampus NetworkManager[1161]: <info> [1496791266.9530] lease time 7200
jun 07 01:21:06 pampus NetworkManager[1161]: <info> [1496791266.9530] nameserver '192.168.0.3'
jun 07 01:21:06 pampus NetworkManager[1161]: <info> [1496791266.9530] nameserver '192.168.0.4'
jun 07 01:21:06 pampus NetworkManager[1161]: <info> [1496791266.9530] domain name 'office.screenpointmed.com'
jun 07 01:21:06 pampus NetworkManager[1161]: <info> [1496791266.9531] dhcp4 (eth0): state changed bound -> bound

After this, dhclient is supposed to keep the lease fresh, which it does. E.g. at 03:13:19 you can see a DHCPREQUEST and DHCPACK; I've seen this DHCPACK in a tcpdump and it contains a new lease time of 7200 seconds.

jun 07 03:13:19 pampus dhclient[1532]: DHCPREQUEST of 192.168.0.166 on eth0 to 192.168.0.4 port 67 (xid=0x3295b440)
jun 07 03:13:19 pampus dhclient[1532]: DHCPACK of 192.168.0.166 from 192.168.0.4
jun 07 03:13:19 pampus dhclient[1532]: bound to 192.168.0.166 -- renewal in 2708 seconds.

However, at 03:21:07 (exactly 2 hours and 1 second after the last lease reported by NetworkManager) Avahi and NTP report that the IP address is gone:

jun 07 03:21:07 pampus avahi-daemon[1167]: Withdrawing address record for 192.168.0.166 on eth0.
jun 07 03:21:07 pampus avahi-daemon[1167]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.0.166.
jun 07 03:21:07 pampus avahi-daemon[1167]: Interface eth0.IPv4 no longer relevant for mDNS.
jun 07 03:21:08 pampus ntpd[18832]: Deleting interface #3 eth0, 192.168.0.166#123, interface stats: received=2512, sent=2549, dropped=0, active_time=111819 secs

So I suspect NetworkManager dropped the IP address from the interface, because it wasn't informed by dhclient that the lease was renewed. The logs don't explicitly say this, so I may have to turn on more verbose debugging logs in NetworkManager or dhclient to verify this.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: network-manager 1.2.6-0ubuntu0.16.04.1
ProcVersionSignature: Ubuntu 4.4.0-66.87-generic 4.4.44
Uname: Linux 4.4.0-66-generic x86_64
NonfreeKernelModules: nvidia_uvm nvidia_drm nvidia_modeset nvidia
ApportVersion: 2.20.1-0ubuntu2.6
Architecture: amd64
Date: Wed Jun 7 14:48:59 2017
IfupdownConfig:
 # interfaces(5) file used by ifup(8) and ifdown(8)
 auto lo
 iface lo inet loopback
InstallationDate: Installed on 2016-11-04 (214 days ago)
InstallationMedia: Ubuntu 14.04.5 LTS "Trusty Tahr" - Release amd64 (20160803)
IpRoute:
 default via 192.168.0.5 dev eth0 proto static metric 100
 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.166
 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.166 metric 100
IwConfig:
 lo no wireless extensions.

 eth1 no wireless extensions.

 eth0 no wireless extensions.
NetworkManager.state:
 [main]
 NetworkingEnabled=true
 WirelessEnabled=true
 WWANEnabled=true
 WimaxEnabled=true
RfKill:

SourcePackage: network-manager
UpgradeStatus: No upgrade log present (probably fresh install)
nmcli-con:
 NAME UUID TYPE TIMESTAMP TIMESTAMP-REAL AUTOCONNECT AUTOCONNECT-PRIORITY READONLY DBUS-PATH ACTIVE DEVICE STATE ACTIVE-PATH
 Wired connection 1 37da1802-e1ce-3326-a6d0-f855cc32806d 802-3-ethernet 1496839466 wo 07 jun 2017 14:44:26 CEST yes 4294966297 no /org/freedesktop/NetworkManager/Settings/0 yes eth0 activated /org/freedesktop/NetworkManager/ActiveConnection/0
 Wired connection 2 a040d7fe-3c52-39ba-82b8-50fad0b602c1 802-3-ethernet 1496399665 vr 02 jun 2017 12:34:25 CEST yes 4294966297 no /org/freedesktop/NetworkManager/Settings/1 no -- -- --
nmcli-dev:
 DEVICE TYPE STATE DBUS-PATH CONNECTION CON-UUID CON-PATH
 eth0 ethernet connected /org/freedesktop/NetworkManager/Devices/0 Wired connection 1 37da1802-e1ce-3326-a6d0-f855cc32806d /org/freedesktop/NetworkManager/ActiveConnection/0
 eth1 ethernet unavailable /org/freedesktop/NetworkManager/Devices/2 -- -- --
 lo loopback unmanaged /org/freedesktop/NetworkManager/Devices/1 -- -- --
nmcli-nm:
 RUNNING VERSION STATE STARTUP CONNECTIVITY NETWORKING WIFI-HW WIFI WWAN-HW WWAN
 running 1.2.6 connected started full enabled enabled enabled enabled enabled

Sjors Gielen (sgielen) wrote :
Sjors Gielen (sgielen) wrote :

By setting log_level to DEBUG, I could confirm from the logs that there is a miscommunication between dhclient and NetworkManager causing this issue.

It looks like it is not NetworkManager that removes the IPv4 address from the interface; the address is removed from the interface automatically by the kernel because its lifetime expired:

jun 13 19:26:21 cuba NetworkManager[28642]: <debug> [1497374781.0763] platform: address: adding or updating IPv4 address: 192.168.0.55/24 lft 7200sec pref 7200sec lifetime 99735-0[7200,7200] dev 3 src unknown
jun 13 19:26:21 cuba NetworkManager[28642]: <debug> [1497374781.0763] platform: signal: address 4 changed: 192.168.0.55/24 lft 7200sec pref 7200sec lifetime 99735-99735[7200,7200] dev 3 src kernel
jun 13 21:26:21 cuba NetworkManager[28642]: <debug> [1497381981.3191] platform: signal: address 4 removed: 192.168.0.55/24 lft 0sec pref 0sec lifetime 106935-99735[7200,7200] dev 3 src kernel

The "address: adding or updating IPv4 address" message comes right after receiving a message from dhclient, because it sent a DHCPREQUEST and received a DHCPACK:

jun 13 19:26:21 cuba dhclient[13154]: DHCPREQUEST of 192.168.0.55 on eth1 to 192.168.0.3 port 67 (xid=0xfd7483b)
jun 13 19:26:21 cuba dhclient[13154]: DHCPACK of 192.168.0.55 from 192.168.0.3
jun 13 19:26:21 cuba NetworkManager[28642]: <debug> [1497374781.0748] bus-manager: (dhcp) accepted connection 0x7f9ae000fc60 on private socket
jun 13 19:26:21 cuba NetworkManager[28642]: <debug> [1497374781.0759] dhcp4 (eth1): DHCP reason 'RENEW' -> state 'bound'

Within the 2 hours lifetime of the IP address, another DHCPREQUEST & DHCPACK occurs, and some communication is attempted between dhclient and NetworkManager, but this doesn't result in the state change as seen above:

jun 13 20:20:11 cuba dhclient[13154]: DHCPREQUEST of 192.168.0.55 on eth1 to 192.168.0.3 port 67 (xid=0xfd7483b)
jun 13 20:20:11 cuba dhclient[13154]: DHCPACK of 192.168.0.55 from 192.168.0.3
jun 13 20:20:11 cuba NetworkManager[28642]: <debug> [1497378011.6527] bus-manager: (dhcp) accepted connection 0x7f9ae0019060 on private socket
jun 13 20:20:11 cuba NetworkManager[28642]: <debug> [1497378011.6527] bus-manager: (dhcp) closed connection 0x7f9ae0019060 on private socket
jun 13 20:20:11 cuba dhclient[13154]: bound to 192.168.0.55 -- renewal in 2731 seconds.

So, crucially, there is an attempted communication between dhclient and NetworkManager, but this doesn't result in an update to the lifetime of the IPv4 address. I'll focus my investigation on this. I would appreciate a reply from maintainers or upstream that they are aware of this issue.

Sjors Gielen (sgielen) on 2017-06-14
summary: - NetworkManager seems to drop IPv4 DHCP lease even though it was
- successfully renewed
+ NetworkManager does not update IPv4 address lifetime even though DHCP
+ lease was successfully renewed
Sjors Gielen (sgielen) wrote :

I've closed down on the root cause being the /usr/lib/NetworkManager/nm-dhcp-helper tool. Occasionally, this binary runs, but fails to correctly send the update to NetworkManager. No errors occur when this happens; NetworkManager in debug mode just says "accepted connection on private socket" then "closed connection on private socket" without any updates happening.

I've managed to work around the issue by wrapping /usr/lib/NetworkManager/nm-dhcp-helper in a shell script that simply performs the same lease update until the logs indicate that NetworkManager received it. This doesn't fix the communication problem, but adds a safety net that prevents the resulting issues. It's been tested in an office network of some 12 PCs.

If anyone runs into this issue as well, run the following script to work around it:

-----8<-----
#!/bin/bash

HELPERSCRIPT="/usr/lib/NetworkManager/nm-dhcp-helper"
HELPERBIN="/usr/lib/NetworkManager/nm-dhcp-helper.bin"

function is_elf() {
 readelf -h "$1" >/dev/null 2>&1
 if [ "$?" = "1" ]; then
  echo "0"
 else
  echo "1"
 fi
}

if [ "$(is_elf $HELPERSCRIPT)" = "1" ]; then
  mv "$HELPERSCRIPT" "$HELPERBIN"
fi

cat <<EOF >"$HELPERSCRIPT"
#!/usr/bin/perl
use strict;
use warnings;

if(\$< != 0) {
 die "Must run as root\n";
}

my \$reason = \$ENV{reason} || "";
if(\$reason eq "PREINIT") {
 # not lease information, so waiting for the journal will make
 # nm-dhcp-helper wait for too long, just send it once and exit so
 # dhclient will start to get a lease
 system("${HELPERBIN}");
 exit(0);
}

my \$attempts = 0;
my \$success = 0;
while(\$attempts < 10) {
 \$attempts++;
 my \$time = time();
 sleep(1);
 system("${HELPERBIN}");
 sleep(1);
 my \$leasetime = \`/bin/journalctl --since='\\@\$time' | grep NetworkManager | grep ' lease time ' | wc -l\`;
 if(\$leasetime == 1) {
  \$success = 1;
  last;
 }
 # Try again in 5 seconds
 sleep(5);
}

if(\$attempts > 1) {
 open my \$fh, ">>", "/tmp/nm-helper-retries.log" or die \$!;
 my \$date = \`/bin/date\`;
 1 while chomp \$date;
 if(\$success) {
  print \$fh "\$date: needed \$attempts attempts to update NetworkManager (\$reason).\n";
 } else {
  print \$fh "\$date: gave up after \$attempts attempts (\$reason).\n";
 }
 close \$fh;
}

exit(0);
EOF
chmod +x $HELPERSCRIPT
/usr/sbin/aa-complain /etc/apparmor.d/sbin.dhclient

Sebastien Bacher (seb128) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. The issue you are reporting is an upstream one and it would be nice if somebody having it could send the bug to the developers of the software by following the instructions at https://wiki.ubuntu.com/Bugs/Upstream/GNOME. If you have done so, please tell us the number of the upstream bug (or the link), so we can add a bugwatch that will inform us about its status. Thanks in advance.

Changed in network-manager (Ubuntu):
importance: Undecided → Low
Sjors Gielen (sgielen) wrote :

Hi Sebastien, thanks for your comment!

I've reported the bug upstream as GNOME #784636, see https://bugzilla.gnome.org/show_bug.cgi?id=784636. I already added a bugwatch.

Changed in network-manager:
importance: Unknown → Medium
status: Unknown → Confirmed
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in network-manager (Ubuntu):
status: New → Confirmed
Ray Link (rlink) wrote :

Upstream GNOME bug fixed. I've applied the fix to Xenial's NM and tested. SRU incoming.

Ray Link (rlink) wrote :

[Impact]

 * nm-dhcp-helper sometimes fails to notify NetworkManager of a DHCP
   lease renewal due to a DBus race condition.

 * Upstream NetworkManager 1.4 fixes the race condition by changing
   nm-dhcp-helper's DBus notification from signal "Event" to method
   "Notify".

 * Upstream backported NM 1.4's nm-dhcp-helper notification fix to
   NM 1.2. This SRU request applies that backported patch to Xenial's
   NM 1.2.x.

[Test Case]

 * Not reliably reproducible. Out of hudreds of machines, only a
   dozen or so fail to notify NetworkManager of a DHCP lease
   renewal about 30-50% of the time. (i.e. It's always the same
   handful of machines that fail.)

 * All such machines with the patched packages have been fine for weeks,
   over many dozens of lease renewals.

[Regression Potential]

 * Minimal. Upstream's patch falls back to the former behavior if the
   new behavior is not supported or times out.

[Other Info]

 * Upstream bug w/ patch: https://bugzilla.gnome.org/show_bug.cgi?id=784636

 * RHEL bug with links to the 1.4 commits from which the patch was
   derived: https://bugzilla.redhat.com/show_bug.cgi?id=1373276

Ray Link (rlink) wrote :

Correction/Addendum to the SRU justification above:

Fixed in current Ubuntu. Bionic uses NM 1.8.x. This bug was fixed in 1.4.

[Impact]

* This LP bug's original submitter backported the upstream 1.4 fix to 1.2, not Upstream.

[Regression Potential]

* The patch changes both nm-dhcp-helper and NetworkManager itself. As soon as the new packages are unpacked, the new nm-dhcp-helper will be used on DHCP lease renewals, with the new Notify mechanism. Since the running, old NetworkManager is still expecting Event notifications, the patched nm-dhcp-helper has fallback capability to Event.

* Once NetworkManager is restarted and is running the patched version, it will have the new Notify support.

Changed in network-manager (Ubuntu):
status: Confirmed → Fix Released
Changed in network-manager (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → Low
Ray Link (rlink) on 2017-11-14
description: updated
Julian Andres Klode (juliank) wrote :

Fix uploaded.

Changed in network-manager (Ubuntu Xenial):
status: Triaged → In Progress
description: updated
Luke Faraone (lfaraone) on 2017-11-14
description: updated
Changed in network-manager (Ubuntu Xenial):
assignee: nobody → Luke Faraone (lfaraone)
assignee: Luke Faraone (lfaraone) → nobody
Sjors Gielen (sgielen) wrote :

Hi Ray/Julian,

> * NOTE: The final comment on the upstream GNOME bug claims that the fix
> is incomplete. However, it is possible that the running NetworkManager was
> not restarted (see Regression Potential notes above), which is why
> nm-dhcp-helper is falling back to Event.

This is not the case. Even today, one of the machines is showing the message from the wrapper that indicates the DHCP lease was not correctly applied according to the journal:

➜ sjors@cuba ~ cat /tmp/nm-helper-retries.log
Tue Nov 14 07:23:07 CET 2017: needed 5 attempts to update NetworkManager (RENEW).
Tue Nov 14 09:17:45 CET 2017: needed 5 attempts to update NetworkManager (RENEW).
Tue Nov 14 10:06:58 CET 2017: needed 4 attempts to update NetworkManager (RENEW).

This is even though the machine was rebooted yesterday, so the daemon was restarted:

➜ sjors@cuba ~ uptime
 23:12:16 up 1 day, 14:04, 3 users, load average: 0,59, 0,49, 0,47

And the machine is using the patched version of the network-manager:

➜ sjors@cuba ~ apt-cache policy network-manager
network-manager:
  Installed: 1.2.6-0ubuntu0.16.04.1screenpoint1
  Candidate: 1.2.6-0ubuntu0.16.04.1screenpoint1
  Version table:
 *** 1.2.6-0ubuntu0.16.04.1screenpoint1 100
        100 /var/lib/dpkg/status
     1.2.6-0ubuntu0.16.04.1 500
        500 http://nl.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages

However, I have not investigated why this happens, as the wrapper script is an acceptable work-around. I'll report back, however, whether we still have this problem with the updated Xenial packages -- there is always a chance I made an error somewhere.

Ray Link (rlink) wrote :

Sjors:

I'm interested to know how it goes for you with this update (either from the included debdiff, or from the packages when they hit -proposed.) We were seeing the exact same symptoms you were, and deep-dove into your backported fix from the GNOME bug and it all seems like it should fix the issue. We've tested extensively with your fix and the symptoms completely disappeared, even from the machines most frequently affected.

Hello Sjors, or anyone else affected,

Accepted network-manager into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/network-manager/1.2.6-0ubuntu0.16.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in network-manager (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Ray Link (rlink) wrote :

Sorry for the delay - I was away for a week. I have deployed the packages from -proposed and will confirm after keeping an eye on DHCP lease renewals for a while.

Ray Link (rlink) wrote :

Version 1.2.6-0ubuntu0.16.04.2 has been deployed here to several machines that were experiencing the bug multiple times per day. All DHCP renewal requests since deployment have resulted in renewal notifications being successfully delivered to NetworkManager.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Brian Murray (brian-murray) wrote :

There seems to be an autopkgtest failure on s390x which requires investigation.

http://autopkgtest.ubuntu.com/packages/n/network-manager/xenial/s390x

Ray Link (rlink) wrote :

The autopkgtest failure on s390x is in a test that checks rfkill integration. s390 kernels do not have CONFIG_RFKILL set. Behavior change might have something to do with the recent change to s390x autopkgtest from LXC containers to VMs?

https://lists.ubuntu.com/archives/ubuntu-devel/2017-November/040052.html

Successful s390 autpkgtests of previous versions of network-manager skipped this test (among others):

> urfkill-integration SKIP Test requires machine-level isolation but testbed does not provide that

Ray Link (rlink) wrote :

Also note that the previous (current) version of network-manager (1.2.6-0ubuntu0.16.04.1) is also failing s390x autopkgtest.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package network-manager - 1.2.6-0ubuntu0.16.04.2

---------------
network-manager (1.2.6-0ubuntu0.16.04.2) xenial; urgency=medium

  * Fix nm-dhcp-helper DBus Event/Notify race condition (LP: #1696415)

 -- Ray Link <email address hidden> Thu, 19 Oct 2017 13:53:52 -0400

Changed in network-manager (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for network-manager has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.