network-manager ADT tests fail with on ppc64el with artful/linux 4.13.0.17.18

Bug #1733321 reported by Kleber Sacilotto de Souza on 2017-11-20
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
network-manager (Ubuntu)
Medium
Unassigned
Artful
Medium
Unassigned

Bug Description

[Impact]

The killswitches-no-urfkill autopkgtest fails sometimes because nmcli reports the old state when it's called right after rfkill block/unblock. Adding a sleep before calling nmcli fixes the issue.

ppc64el ADT log from failed testcase: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-artful/artful/ppc64el/n/network-manager/20171120_100719_28642@/log.gz

Testcase output:
-----
autopkgtest [10:04:48]: test killswitches-no-urfkill: [-----------------------
make -C /lib/modules/4.13.0-17-generic/build KBUILD_SRC=/lib/modules/4.13.0-17-generic/build M=/tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests
make[1]: Entering directory '/usr/src/linux-headers-4.13.0-17-generic'
  AR /tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests/built-in.o
  CC [M] /tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests/fake-rfkill.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC /tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests/fake-rfkill.mod.o
  LD [M] /tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests/fake-rfkill.ko
make[1]: Leaving directory '/usr/src/linux-headers-4.13.0-17-generic'
ERROR: NM could not track device state.
autopkgtest [10:05:20]: test killswitches-no-urfkill: -----------------------]
autopkgtest [10:05:20]: test killswitches-no-urfkill: - - - - - - - - - - results - - - - - - - - - -
killswitches-no-urfkill FAIL non-zero exit status 1
-----

Package versions [artful/ppc64el]:
network-manager 1.8.4-1ubuntu3
linux-meta 4.13.0.17.18

[Test Case]

2.1. Download network-manager package source code
$ apt-get source network-manager

2.2. Run killswitches-no-urfkill testcase
$ cd network-manager-1.8.4
$ sudo ./debian/tests/killswitches-no-urfkill

[Regression Potential]

No regression potential. The fix touches only the testcase shipped with the source package and it doesn't change the binary package. The sleep time added is very small, so no possibility of causing testcase timeout.

[Other Info]

On ppc64el architecture, it was observed that a fix for systemd is also needed (see bug 1734908) for the testcase to be successful.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1733321

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: artful
Changed in linux (Ubuntu Artful):
status: New → Triaged
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu Artful):
importance: Undecided → Medium
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-key

The issue seems to be related somehow to systemd's systemd-rfkill.service. When fake-rfkill is loaded, the following message are logged on syslog:

Nov 21 17:00:32 autopkgtest NetworkManager[1044]: <info> [1511283632.4948] rfkill13: found WiFi radio killswitch (at /sys/devices/virtual/rfkill/rfkill13) (driver (unknown))
Nov 21 17:00:32 autopkgtest kernel: [ 2970.858563] query test rfkill
Nov 21 17:00:32 autopkgtest kernel: [ 2970.858566] set test rfkill (active)
Nov 21 17:00:32 autopkgtest systemd[1]: Starting Load/Save RF Kill Switch Status...
Nov 21 17:00:38 autopkgtest kernel: [ 2976.544106] poll test rfkill
Nov 21 17:00:43 autopkgtest kernel: [ 2981.407971] poll test rfkill
Nov 21 17:00:48 autopkgtest kernel: [ 2986.527817] poll test rfkill
Nov 21 17:00:53 autopkgtest kernel: [ 2991.391678] poll test rfkill
Nov 21 17:00:58 autopkgtest kernel: [ 2996.511538] poll test rfkill
Nov 21 17:01:02 autopkgtest systemd[1]: systemd-rfkill.service: Start operation timed out. Terminating.
Nov 21 17:01:02 autopkgtest systemd[1]: Failed to start Load/Save RF Kill Switch Status.
Nov 21 17:01:02 autopkgtest systemd[1]: systemd-rfkill.service: Unit entered failed state.
Nov 21 17:01:02 autopkgtest systemd[1]: systemd-rfkill.service: Failed with result 'timeout'.
Nov 21 17:01:03 autopkgtest kernel: [ 3001.375390] poll test rfkill
Nov 21 17:01:08 autopkgtest kernel: [ 3006.495239] poll test rfkill
Nov 21 17:01:12 autopkgtest kernel: [ 3011.136564] query test rfkill
Nov 21 17:01:12 autopkgtest kernel: [ 3011.136566] set test rfkill (blocked)
Nov 21 17:01:12 autopkgtest systemd[1]: Starting Load/Save RF Kill Switch Status...
Nov 21 17:01:12 autopkgtest NetworkManager[1044]: <info> [1511283672.8246] manager: rfkill: WiFi now disabled by radio killswitch
Nov 21 17:01:12 autopkgtest systemd[1]: Started Load/Save RF Kill Switch Status.

And the call to 'mcli radio wifi' intermittently fails to report that the device is disabled after the call to 'rfkill block $fake_id'.

Stopping systemd-rfkill.socket and systemd-rfkill.service units seems to work around the issue since I can't reproduce the issue when they are stopped.

As an additional note, I cannot reproduce the issue on the ppc64el host, only on the qemu/kvm guest.

The issue seems to be reproducible only when the VM has a single virtual CPU. I cannot reproduce it when I assign 2 or more CPUs to the VM. That would explain why we could not reproduce the issue on the host.

This is the stack trace of the systemd-rfkill process before it gets killed by systemd after the timeout:

---------
(gdb) bt
#0 0x00007b3666783524 in __GI_ppoll (fds=0x7ffff9e8da40, nfds=1, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1 0x00007b36664cf140 in ppoll () at /usr/include/powerpc64le-linux-gnu/bits/poll2.h:77
#2 fd_wait_for_event (fd=<optimized out>, event=<optimized out>, t=<optimized out>) at ../src/basic/io-util.c:199
#3 0x000000aaa7712b50 in wait_for_initialized (ret=<synthetic pointer>, device=0x100241d0bc0, udev=<optimized out>) at ../src/rfkill/rfkill.c:141
#4 determine_state_file (udev=<optimized out>, event=0x7ffff9e8dbb8, d=0x100241d0bc0, ret=0x7ffff9e8dba8) at ../src/rfkill/rfkill.c:173
#5 0x000000aaa77120a8 in load_state (event=0x7ffff9e8dbb8, udev=<optimized out>, rfkill_fd=<optimized out>) at ../src/rfkill/rfkill.c:220
#6 main (argc=<optimized out>, argv=<optimized out>) at ../src/rfkill/rfkill.c:404
---------

The function wait_for_initialized() uses the udev monitor to listen for events from the rfkill subsystem and stay forever on the call to fd_wait_for_event() when the fake-rfkill module is loaded. I used udevadm to monitor the udev events and the add event is generated:

---------
$ udevadm monitor -s rfkill
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent

KERNEL[1741.847404] add /devices/virtual/rfkill/rfkill3 (rfkill)
UDEV [1741.855315] add /devices/virtual/rfkill/rfkill3 (rfkill)
---------

If the fake-rfkill module is removed and re-inserted before the 30s timeout, the processes receives and processes the remove and the subsequent add events.

So I suspected that, at least with fake-rfkill, systemd-rfkill is not activated fast enough to process the udev event when the rfkill device is added.

The systemd rfkill timeout issue is not specific to ppc64el, I was able to reproduce it in a VM with my local x86_64 laptop.

The following systemd upstream commit fixes the service timeout issue:

==================================================
commit 8ec1a07998758f6a85f3ea5bf2ed14d87609398f
Author: S. Fan <email address hidden>
Date: Mon Jul 31 05:10:10 2017 -0500

    rfkill: fix erroneous behavior when polling the udev monitor (#6489)

    Comparing udev_device_get_sysname(device) and sysname will always return
    true. We need to check the device received from udev monitor instead.

    Also, fd_wait_for_event() sometimes never exits. Better set a timeout
    here.
==================================================

I will check if it also fixes the network-manager testcase.

It turns out that the commit mentioned on the previous comment is necessary but not sufficient to fix the issue. The testcase also needs to be fixed.

I have opened bug 1734908 for the systemd issue.

This is a debdiff for Artful applicable to network-manager 1.8.4-1ubuntu3. This was built with pbuilder and the testcase shipped with the source packaged works as expected (on the environment described on this bug report only if the fix for bug 1734908 is also applied).

description: updated

Although we haven't seen the failure in Bionic yet, it might be a good idea to apply the fix on Bionic as well, which as of now has the same package version as Artful.

tags: removed: kernel-key
Changed in network-manager (Ubuntu):
importance: Undecided → Medium
Changed in network-manager (Ubuntu Artful):
importance: Undecided → Medium
no longer affects: linux (Ubuntu Artful)
no longer affects: linux (Ubuntu)
Simon Quigley (tsimonq2) wrote :

Artful is End Of Life, and thus, there is nothing else to sponsor; sorry.

Please resubscribe ~ubuntu-sponsors if there's something else to sponsor.

Changed in network-manager (Ubuntu Artful):
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers