boot-and-services tests fails in impish on armhf (248.3)

Bug #1931088 reported by Christian Ehrhardt 
28
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Auto Package Testing
Invalid
Undecided
Unassigned
dnsmasq (Ubuntu)
Invalid
Undecided
Unassigned
Impish
Invalid
Undecided
Unassigned
gdm3 (Ubuntu)
Invalid
Undecided
Unassigned
Impish
Invalid
Undecided
Unassigned
glibc (Ubuntu)
Invalid
Undecided
Unassigned
Impish
Invalid
Undecided
Unassigned
isc-dhcp (Ubuntu)
Invalid
Undecided
Unassigned
Impish
Invalid
Undecided
Unassigned
lxd (Ubuntu)
Fix Released
Undecided
Lukas Märdian
Impish
Fix Released
Undecided
Lukas Märdian
systemd (Ubuntu)
Invalid
Undecided
Unassigned
Impish
Invalid
Undecided
Unassigned

Bug Description

Systemd 248.3-1ubuntu1 is rather new, but had 5 successful tests on armhf
before now slipping into a bad mode.

Now it seems all tests failed in boot-and-services by hanging until killed by
VirtSubproc.Timeout of autokgtest.

The last [1] test log has a bit more, it shows a python stack overflow
```
   VirtSubproc.Timeout
   Fatal Python error: Cannot recover from stack overflow.
   Python runtime state: initialized

   Current thread 0x00007f108e840740 (most recent call first):
     File "/home/ubuntu/autopkgtest/lib/adtlog.py", line 36 in log
     File "/home/ubuntu/autopkgtest/lib/adtlog.py", line 86 in debug
     File "/home/ubuntu/autopkgtest/lib/adt_testbed.py", line 472 in send
     File "/home/ubuntu/autopkgtest/lib/adt_testbed.py", line 521 in command
```

I have seen a bunch of packages including even gdm3 and glibc being blocked by
that so I wanted to at least track down the issue until we can put it on
someones task list to resolve.
No one replied to my pings yet, but maybe that means someone is already
debugging this and had enabled some debugging?

By running the same in armhf container on arm64 VM on canonistack I've seen
no issues. The test worked fine and had no hang/issues.

root@systemd-test-fail:~/systemd-248.3# ./debian/tests/boot-and-services
lxc
1
test_profile (__main__.AppArmorTest)
AppArmor confined unit ... skipped 'fails on armhf testbeds, see LP: #1842352'
test_help (__main__.CLITest)
--help works and succeeds ... ok
test_invalid_option (__main__.CLITest)
Calling with invalid option fails ... ok
test_version (__main__.CLITest)
--version works and succeeds ... ok
test_cpushares (__main__.CgroupsTest)
service with CPUShares ... ok
test_simple (__main__.CgroupsTest)
simple service ... ok
test_bash_crash (__main__.CoredumpTest) ... skipped 'systemd-coredump does not work in containers'
test_log_for_service (__main__.JournalTest) ... ok
test_no_options (__main__.JournalTest) ... ok
test_boot (__main__.NspawnTest) ... skipped 'nspawn does not work in most containers'
test_service (__main__.NspawnTest) ... skipped 'nspawn does not work in most containers'
test_failing (__main__.SeccompTest) ... ok
test_0_init (__main__.ServicesTest)
Verify that init is systemd ... ok
test_cron (__main__.ServicesTest) ... ok
test_dbus (__main__.ServicesTest) ... ok
test_gdm3 (__main__.ServicesTest) ... skipped 'gdm3 not found'
test_logind (__main__.ServicesTest) ... ok
test_network_manager (__main__.ServicesTest) ... ok
test_no_failed (__main__.ServicesTest)
No failed units ... ok
test_rsyslog (__main__.ServicesTest) ... ok
test_tmp_cleanup (__main__.ServicesTest) ... ok
test_tmp_mount (__main__.ServicesTest) ... ok
test_udev (__main__.ServicesTest) ... skipped 'udev does not work in containers'
----------------------------------------------------------------------
Ran 23 tests in 5.589s
OK (skipped=6)

So I can not reproduce this on canonistack, but it blocks autopkgtests of
various packages pretty reproducibly :-/

[1]: https://autopkgtest.ubuntu.com/results/autopkgtest-impish/impish/armhf/s/systemd/20210604_081326_d4319@/log.gz

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I was adding a few of the blocked packages as incomplete tasks to have this bug update-excuse show up in excuses. Right now until we know better I'd consider this a systemd issue.

tags: added: update-excuse
Changed in dnsmasq (Ubuntu):
status: New → Incomplete
Changed in gdm3 (Ubuntu):
status: New → Incomplete
Changed in glibc (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

As I said I tried to recreate this, but it worked.
It was fine under Focal/5.4.0-53-generic Host with the Impish-armhf container.
Upgrading the host to impish it Impish/5.11.0-16-generic still works fine.

It seems it only fails in autopkgtest infrastructure, not sure why yet :-/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Also blocking isc-dhcp now, added a task

Changed in isc-dhcp (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Kelsey Steele (kelsey-steele) wrote :
Revision history for this message
Balint Reczey (rbalint) wrote :

Marking auto package testing as affected based on @paelzer's observation that it occurs only on Ubuntu's CI infra. I've also triggered the test with hello to see current status and for confirmation.

Balint Reczey (rbalint)
Changed in systemd (Ubuntu):
status: Confirmed → Incomplete
tags: added: fr-1435
Revision history for this message
Iain Lane (laney) wrote :

Just to put the information on the bug, I reproduced this failure on a Raspberry Pi, not a piece of the Scalingstack infrastructure in sight ;-)

As far as I can tell it looked as if it's timing out waiting for network-online.target to become active. It seems as if installing network-manager makes the target take way longer to activate, so long that it is over autopkgtest's timeout. I suppose this isn't expected, so maybe the next step is to look into why that happens?

Revision history for this message
Iain Lane (laney) wrote (last edit ):

I have added a badtest hint for this version, so that the many packages blocked by this can become unblocked. But could somebody please take a look? The hint is only for this version so the issue will come back with the next systemd upload unless fixed.

(I'll reset systemd to "New" so that it appears on the rls-incoming page)

tags: added: rls-ii-incoming
Changed in auto-package-testing:
status: New → Incomplete
Changed in systemd (Ubuntu):
status: Incomplete → New
tags: removed: rls-ii-incoming
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Kelsey Steele (kelsey-steele) wrote :

Seeing this failure on Focal now too.

Also seeing a similar problem on Bionic armhf which I'm wondering if it's related to this failure. Bug 1932368

Difference is the test failing on Bionic is boot-smoke

Revision history for this message
Lukas Märdian (slyon) wrote :

Looks like I can at least reproduce the issue on an arm64 Canonistack machine, utilizing an armhf container:

ubuntu@focal-arm64:~$ time autopkgtest systemd --test-name=boot-and-services -U -- lxd autopkgtest/ubuntu/impish/armhf
[...]
autopkgtest [15:59:21]: test boot-and-services: [-----------------------
lxc
Created symlink /etc/systemd/system/default.target → /lib/systemd/system/graphical.target.
bash: line 1: 3532 Killed /tmp/autopkgtest.OAYOpR/build.yrL/src/debian/tests/boot-and-services 2> >(tee -a /tmp/autopkgtest.OAYOpR/boot-and-services-stderr >&2) > >(tee -a /tmp/autopkgtest.OAYOpR/boot-and-services-stdout)
autopkgtest [15:59:24]: test process requested reboot with marker boot1
Unexpected error:
Traceback (most recent call last):
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 738, in mainloop
    command()
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 667, in command
    r = f(c, ce)
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 363, in cmd_reboot
    caller.hook_wait_reboot()
  File "/usr/bin/autopkgtest-virt-lxd", line 231, in hook_wait_reboot
    wait_booted()
  File "/usr/bin/autopkgtest-virt-lxd", line 104, in wait_booted
    VirtSubproc.check_exec(['lxc', 'exec', container_name, '--', 'sh', '-ec', '[ ! -d /run/systemd/system ] || systemctl start network-online.target'], timeout=60)
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 182, in check_exec
    (status, out, err) = execute_timeout(None, timeout, real_argv,
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 144, in execute_timeout
    (out, err) = sp.communicate(instr)
  File "/usr/lib/python3.8/subprocess.py", line 1014, in communicate
    stderr = self.stderr.read()
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 64, in alarm_handler
    raise Timeout()
VirtSubproc.Timeout
autopkgtest [16:01:18]: ERROR: testbed failure: unexpected eof from the testbed

real 8m29.354s
user 0m8.866s
sys 0m8.141s

Let's see what we can do about it. It looks like a dnsmasq (security-)update was pushed to Bionic++ on 2021-06-02 – about the same time that this test started failing on Bionic++ ... I wonder if this might somehow be related? ... need to further investigate.

Revision history for this message
Lukas Märdian (slyon) wrote (last edit ):

So I can reproduce the problem even locally, using a simple:
`autopkgtest systemd --test-name=boot-and-services -- lxd autopkgtest/ubuntu/impish/amd64`

When logging into the (stalled) system, while it is waiting for its timeout I can see that eth0 gets configured, but then put in DOWN state. Therefore, systemd-networkd-wait-online.service (and thus network-wait-online.target) never get ready and the test times out waiting on "network-wait-online.target".

A simple "networkctl up eth0" (inside the LXD container) during that 60s "timeout-wait" makes the test runner connect an finish the test successfully.

The culprit for bringing down the interface seems to be "/run/systemd/system/network-device-down.service", but I have no idea where this files comes from or what it is used for. It's not part of any package.

That service has a "Before=NetworkManager.service" dependency, that's why it only breaks tests making use of network-manager (boot-and-services & boot-smoke).

The LXD/LXC stack seems to be somehow involved in creating that file: https://github.com/lxc/lxc-ci/pull/266

Revision history for this message
Lukas Märdian (slyon) wrote :
Lukas Märdian (slyon)
Changed in systemd (Ubuntu Impish):
status: Confirmed → Invalid
Changed in lxd (Ubuntu Impish):
status: New → Fix Committed
assignee: nobody → Lukas Märdian (slyon)
Changed in auto-package-testing:
status: Incomplete → Invalid
Changed in dnsmasq (Ubuntu Impish):
status: Incomplete → Invalid
Changed in gdm3 (Ubuntu Impish):
status: Incomplete → Invalid
Changed in glibc (Ubuntu Impish):
status: Incomplete → Invalid
Changed in isc-dhcp (Ubuntu Impish):
status: Incomplete → Invalid
Revision history for this message
Dan Streetman (ddstreet) wrote :

> This should fix the problem: https://github.com/lxc/distrobuilder/pull/476

wow...all those 'fixes' lxd is adding look like things will continue to break in bad ways, like this bug. ouch.

Lukas Märdian (slyon)
Changed in lxd (Ubuntu Impish):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.