network-manager ADT tests fail with on ppc64el with artful/linux 4.13.0.17.18

Bug #1733321 reported by Kleber Sacilotto de Souza
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
network-manager (Ubuntu)
Medium
Seyeong Kim
Artful
Medium
Unassigned
Bionic
Medium
Seyeong Kim
Disco
Medium
Unassigned
Eoan
Medium
Seyeong Kim
Focal
Medium
Seyeong Kim

Bug Description

[Impact]

The killswitches-no-urfkill autopkgtest fails sometimes because nmcli reports the old state when it's called right after rfkill block/unblock.

ppc64el ADT log from failed testcase: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-artful/artful/ppc64el/n/network-manager/20171120_100719_28642@/log.gz

Testcase output:
-----
autopkgtest [10:04:48]: test killswitches-no-urfkill: [-----------------------
make -C /lib/modules/4.13.0-17-generic/build KBUILD_SRC=/lib/modules/4.13.0-17-generic/build M=/tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests
make[1]: Entering directory '/usr/src/linux-headers-4.13.0-17-generic'
  AR /tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests/built-in.o
  CC [M] /tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests/fake-rfkill.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC /tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests/fake-rfkill.mod.o
  LD [M] /tmp/autopkgtest.yE1hsA/build.62e/src/debian/tests/fake-rfkill.ko
make[1]: Leaving directory '/usr/src/linux-headers-4.13.0-17-generic'
ERROR: NM could not track device state.
autopkgtest [10:05:20]: test killswitches-no-urfkill: -----------------------]
autopkgtest [10:05:20]: test killswitches-no-urfkill: - - - - - - - - - - results - - - - - - - - - -
killswitches-no-urfkill FAIL non-zero exit status 1
-----

Package versions [artful/ppc64el]:
network-manager 1.8.4-1ubuntu3
linux-meta 4.13.0.17.18

[Test Case]

Assume that test vm and host are ppc64

1. deploy ppc64 vm instance ( with 1 cpu )
2. modprobe mac80211_hwsim ( may need to install linux-modules-extra- pkg )
3. apt install network-manager rfkill
4. modify /etc/netplan/[conf], renderer as NetworkManager
5. netplan apply
6. run below command
- nmcli radio wifi ; rfkill list 0 ; rfkill block 0 ; rfkill list 0 ; nmcli radio wifi ; rfkill list 0 ; rfkill unblock 0 ; rfkill list 0 ; nmcli radio wifi

enabled
0: fake: Wireless LAN
Soft blocked: no
Hard blocked: no
0: fake: Wireless LAN
Soft blocked: yes
Hard blocked: no
enabled
0: fake: Wireless LAN
Soft blocked: yes
Hard blocked: no
0: fake: Wireless LAN
Soft blocked: no
Hard blocked: no
enabled

second 'enabled' should be 'disabled' but not updated properly.

Adding "udevadm settle" in test file ( killswitches-no-urkfill ) between
rfkill block/unblock and nmcli radio wifi will help updating status changes after rfkill block/unblock.

[Regression Potential]

This fixes testcase only, so any regression would cause incorrect test pass/fail, and might cause other missed bugs.

[scope]

this is needed for all releases. Debian does not include this autopkgtest, and so does not need this fix.

[Other Info]

this is caused by the 'systemd-rfkill.socket' listening to rfkill, and starting up 'systemd-rfkill.service' for any change. On a 1-cpu system (which all autopkgtest instances for network-manager are), this service startup sometimes blocks the uevent from reaching network-manager before the autopkgtest proceeds to call nmcli to check the rfkill status. This causes the test case to fail.

There are (at least) 2 options to fix this:
1) stop/disable the 'systemd-rfkill.socket' at the start of the autopkgtest
2) call udevadm settle between the rfkill block/unblock and the nmcli call to check status

This does not need to be fixed outside the test case, as normal nmcli use should be not done by users in a script immediately after changing rfkill state, nor is there anything that either rfkill or nmcli could even do to address this (technically, nmcli could internally issue a 'udevadm settle' every time it's called, but that seems paranoid and extreme).

[original description]

On ppc64el architecture, it was observed that a fix for systemd is also needed (see bug 1734908) for the testcase to be successful.

# original test case

2.1. Download network-manager package source code
$ apt-get source network-manager

2.2. Run killswitches-no-urfkill testcase
$ cd network-manager-1.8.4
$ sudo ./debian/tests/killswitches-no-urfkill

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1733321

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: artful
Changed in linux (Ubuntu Artful):
status: New → Triaged
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu Artful):
importance: Undecided → Medium
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-key
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

The issue seems to be related somehow to systemd's systemd-rfkill.service. When fake-rfkill is loaded, the following message are logged on syslog:

Nov 21 17:00:32 autopkgtest NetworkManager[1044]: <info> [1511283632.4948] rfkill13: found WiFi radio killswitch (at /sys/devices/virtual/rfkill/rfkill13) (driver (unknown))
Nov 21 17:00:32 autopkgtest kernel: [ 2970.858563] query test rfkill
Nov 21 17:00:32 autopkgtest kernel: [ 2970.858566] set test rfkill (active)
Nov 21 17:00:32 autopkgtest systemd[1]: Starting Load/Save RF Kill Switch Status...
Nov 21 17:00:38 autopkgtest kernel: [ 2976.544106] poll test rfkill
Nov 21 17:00:43 autopkgtest kernel: [ 2981.407971] poll test rfkill
Nov 21 17:00:48 autopkgtest kernel: [ 2986.527817] poll test rfkill
Nov 21 17:00:53 autopkgtest kernel: [ 2991.391678] poll test rfkill
Nov 21 17:00:58 autopkgtest kernel: [ 2996.511538] poll test rfkill
Nov 21 17:01:02 autopkgtest systemd[1]: systemd-rfkill.service: Start operation timed out. Terminating.
Nov 21 17:01:02 autopkgtest systemd[1]: Failed to start Load/Save RF Kill Switch Status.
Nov 21 17:01:02 autopkgtest systemd[1]: systemd-rfkill.service: Unit entered failed state.
Nov 21 17:01:02 autopkgtest systemd[1]: systemd-rfkill.service: Failed with result 'timeout'.
Nov 21 17:01:03 autopkgtest kernel: [ 3001.375390] poll test rfkill
Nov 21 17:01:08 autopkgtest kernel: [ 3006.495239] poll test rfkill
Nov 21 17:01:12 autopkgtest kernel: [ 3011.136564] query test rfkill
Nov 21 17:01:12 autopkgtest kernel: [ 3011.136566] set test rfkill (blocked)
Nov 21 17:01:12 autopkgtest systemd[1]: Starting Load/Save RF Kill Switch Status...
Nov 21 17:01:12 autopkgtest NetworkManager[1044]: <info> [1511283672.8246] manager: rfkill: WiFi now disabled by radio killswitch
Nov 21 17:01:12 autopkgtest systemd[1]: Started Load/Save RF Kill Switch Status.

And the call to 'mcli radio wifi' intermittently fails to report that the device is disabled after the call to 'rfkill block $fake_id'.

Stopping systemd-rfkill.socket and systemd-rfkill.service units seems to work around the issue since I can't reproduce the issue when they are stopped.

As an additional note, I cannot reproduce the issue on the ppc64el host, only on the qemu/kvm guest.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

The issue seems to be reproducible only when the VM has a single virtual CPU. I cannot reproduce it when I assign 2 or more CPUs to the VM. That would explain why we could not reproduce the issue on the host.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This is the stack trace of the systemd-rfkill process before it gets killed by systemd after the timeout:

---------
(gdb) bt
#0 0x00007b3666783524 in __GI_ppoll (fds=0x7ffff9e8da40, nfds=1, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1 0x00007b36664cf140 in ppoll () at /usr/include/powerpc64le-linux-gnu/bits/poll2.h:77
#2 fd_wait_for_event (fd=<optimized out>, event=<optimized out>, t=<optimized out>) at ../src/basic/io-util.c:199
#3 0x000000aaa7712b50 in wait_for_initialized (ret=<synthetic pointer>, device=0x100241d0bc0, udev=<optimized out>) at ../src/rfkill/rfkill.c:141
#4 determine_state_file (udev=<optimized out>, event=0x7ffff9e8dbb8, d=0x100241d0bc0, ret=0x7ffff9e8dba8) at ../src/rfkill/rfkill.c:173
#5 0x000000aaa77120a8 in load_state (event=0x7ffff9e8dbb8, udev=<optimized out>, rfkill_fd=<optimized out>) at ../src/rfkill/rfkill.c:220
#6 main (argc=<optimized out>, argv=<optimized out>) at ../src/rfkill/rfkill.c:404
---------

The function wait_for_initialized() uses the udev monitor to listen for events from the rfkill subsystem and stay forever on the call to fd_wait_for_event() when the fake-rfkill module is loaded. I used udevadm to monitor the udev events and the add event is generated:

---------
$ udevadm monitor -s rfkill
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent

KERNEL[1741.847404] add /devices/virtual/rfkill/rfkill3 (rfkill)
UDEV [1741.855315] add /devices/virtual/rfkill/rfkill3 (rfkill)
---------

If the fake-rfkill module is removed and re-inserted before the 30s timeout, the processes receives and processes the remove and the subsequent add events.

So I suspected that, at least with fake-rfkill, systemd-rfkill is not activated fast enough to process the udev event when the rfkill device is added.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

The systemd rfkill timeout issue is not specific to ppc64el, I was able to reproduce it in a VM with my local x86_64 laptop.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

The following systemd upstream commit fixes the service timeout issue:

==================================================
commit 8ec1a07998758f6a85f3ea5bf2ed14d87609398f
Author: S. Fan <email address hidden>
Date: Mon Jul 31 05:10:10 2017 -0500

    rfkill: fix erroneous behavior when polling the udev monitor (#6489)

    Comparing udev_device_get_sysname(device) and sysname will always return
    true. We need to check the device received from udev monitor instead.

    Also, fd_wait_for_event() sometimes never exits. Better set a timeout
    here.
==================================================

I will check if it also fixes the network-manager testcase.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

It turns out that the commit mentioned on the previous comment is necessary but not sufficient to fix the issue. The testcase also needs to be fixed.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

I have opened bug 1734908 for the systemd issue.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This is a debdiff for Artful applicable to network-manager 1.8.4-1ubuntu3. This was built with pbuilder and the testcase shipped with the source packaged works as expected (on the environment described on this bug report only if the fix for bug 1734908 is also applied).

description: updated
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Although we haven't seen the failure in Bionic yet, it might be a good idea to apply the fix on Bionic as well, which as of now has the same package version as Artful.

tags: removed: kernel-key
Mathew Hodson (mhodson)
Changed in network-manager (Ubuntu):
importance: Undecided → Medium
Changed in network-manager (Ubuntu Artful):
importance: Undecided → Medium
no longer affects: linux (Ubuntu Artful)
no longer affects: linux (Ubuntu)
Revision history for this message
Simon Quigley (tsimonq2) wrote :

Artful is End Of Life, and thus, there is nothing else to sponsor; sorry.

Please resubscribe ~ubuntu-sponsors if there's something else to sponsor.

Changed in network-manager (Ubuntu Artful):
status: New → Won't Fix
Revision history for this message
Dan Streetman (ddstreet) wrote :
Changed in network-manager (Ubuntu Eoan):
importance: Undecided → Medium
Changed in network-manager (Ubuntu Disco):
importance: Undecided → Medium
Mathew Hodson (mhodson)
tags: added: testcase
Changed in network-manager (Ubuntu Disco):
status: New → Won't Fix
Revision history for this message
Dan Streetman (ddstreet) wrote :

@dgadomski if you have time, could you check this and upload a fixed package?

Since this is autopkgtest test failure only, it will probably need block-proposed-eoan tag. I'm not sure if this is needed for bionic.

Dan Streetman (ddstreet)
tags: added: sts-sponsor-volunteer
Changed in network-manager (Ubuntu Eoan):
assignee: nobody → Dariusz Gadomski (dgadomski)
Changed in network-manager (Ubuntu Focal):
assignee: nobody → Dariusz Gadomski (dgadomski)
Changed in network-manager (Ubuntu Eoan):
assignee: Dariusz Gadomski (dgadomski) → nobody
Changed in network-manager (Ubuntu Focal):
assignee: Dariusz Gadomski (dgadomski) → nobody
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Hello, I uploaded pkg to ppa https://launchpad.net/~xtrusia/+archive/ubuntu/lp1733321

and checked ppc64el build and testing.

It worked fine without any issue.

Could you please adjust Kleber's debdiff as well?

Thanks a lot.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Thanks! Could you check the rfkill code, also? I think instead of patching the testcase, it would be better fix rfkill so the change is effective as soon as it returns; that way users won't experience this and have to add their own sleep after calling rfkill.

Changed in network-manager (Ubuntu Eoan):
status: New → Won't Fix
tags: removed: artful
Revision history for this message
Sebastien Bacher (seb128) wrote :

Dan, Seyeong, is one of you commited to work on a proper fix there? Can you take assignement of the bug? It would be nice to see the problem resolved and without a slap hacking if possible :-)

Revision history for this message
Dan Streetman (ddstreet) wrote :

Yep, we will get this fixed - I'll assign Seyeong work with him on it.

Changed in network-manager (Ubuntu Focal):
assignee: nobody → Seyeong Kim (xtrusia)
Seyeong Kim (seyeongkim)
description: updated
Seyeong Kim (seyeongkim)
description: updated
Dan Streetman (ddstreet)
description: updated
Dan Streetman (ddstreet)
tags: added: block-proposed-bionic block-proposed-eoan block-proposed-focal
Changed in network-manager (Ubuntu Eoan):
status: Won't Fix → In Progress
Changed in network-manager (Ubuntu Focal):
status: New → In Progress
Changed in network-manager (Ubuntu Bionic):
status: New → In Progress
assignee: nobody → Seyeong Kim (xtrusia)
Changed in network-manager (Ubuntu Eoan):
assignee: nobody → Seyeong Kim (xtrusia)
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Kleber, or anyone else affected,

Accepted network-manager into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/network-manager/1.22.10-1ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in network-manager (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Changed in network-manager (Ubuntu Eoan):
status: In Progress → Fix Committed
tags: added: verification-needed-eoan
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Kleber, or anyone else affected,

Accepted network-manager into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/network-manager/1.20.4-2ubuntu2.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Kleber, or anyone else affected,

Accepted network-manager into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/network-manager/1.10.6-2ubuntu1.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in network-manager (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Mathew Hodson (mhodson)
Changed in network-manager (Ubuntu Bionic):
importance: Undecided → Medium
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (network-manager/1.10.6-2ubuntu1.5)

All autopkgtests for the newly accepted network-manager (1.10.6-2ubuntu1.5) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

systemd/237-3ubuntu10.39 (i386)
netplan.io/0.98-0ubuntu1~18.04.1 (ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#network-manager

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (network-manager/1.22.10-1ubuntu2)

All autopkgtests for the newly accepted network-manager (1.22.10-1ubuntu2) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

systemd/245.4-4ubuntu3 (ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#network-manager

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (network-manager/1.20.4-2ubuntu2.3)

All autopkgtests for the newly accepted network-manager (1.20.4-2ubuntu2.3) for eoan have finished running.
The following regressions have been reported in tests triggered by the package:

systemd/242-7ubuntu3.7 (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/eoan/update_excuses.html#network-manager

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package network-manager - 1.22.10-1ubuntu3

---------------
network-manager (1.22.10-1ubuntu3) groovy; urgency=medium

  * No-change rebuild to bump Groovy version number.

network-manager (1.22.10-1ubuntu2) focal; urgency=medium

  * d/t/killswitches-no-urfkill
    - Call udevadm settle before nmcli radio wifi
      To make sure that change is updated. (LP: #1733321)

 -- Dan Streetman <email address hidden> Thu, 30 Apr 2020 07:47:49 -0400

Changed in network-manager (Ubuntu):
status: New → Fix Released
Revision history for this message
Dan Streetman (ddstreet) wrote :
tags: added: verification-done verification-done-bionic verification-done-eoan verification-done-focal
removed: sts-sponsor-volunteer verification-needed verification-needed-bionic verification-needed-eoan verification-needed-focal
tags: removed: block-proposed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package network-manager - 1.20.4-2ubuntu2.3

---------------
network-manager (1.20.4-2ubuntu2.3) eoan; urgency=medium

  * d/t/killswitches-no-urfkill
    - Call udevadm settle before nmcli radio wifi
      To make sure that change is updated. (LP: #1733321)

 -- Seyeong Kim <email address hidden> Mon, 20 Apr 2020 22:40:44 -0700

Changed in network-manager (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for network-manager has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Mathew Hodson (mhodson)
tags: removed: block-proposed-eoan
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers