autopkgtest times out (or fails before that) in hirsute on test infra, passes locally

Bug #1915126 reported by Christian Ehrhardt 
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Auto Package Testing
Incomplete
Undecided
Unassigned
systemd (Ubuntu)
Fix Released
High
Unassigned

Bug Description

Hi,
I've asked yesterday on IRC but so far got no answer. I think it is right to file a bug about the current state of systemd autopkgtest to unite the efforts in regard to it.

I was looking at the systemd tests for a no-change rebuild that really had no reason to now make it fail. While checking I found that as of this month (first bad test on 1st of February) most of the systemd test runs on amd64 will not pass/fail but instead in most cases time out.

I've seen various people retry the case as it shows the typical symptoms of a "not the fault of the package, let us retry this" case. But it seems that won't help as the test history is rather clear.

* Watch this in monospace to make more sense of it *

$ check-autopkgtest-stats.sh -c 50 -p systemd -r "hirsute" -a "amd64"
Check last 50 test results for src:systemd on releases 'hirsute' on architectures 'amd64'
Of the 50 last tests, we had these subtest failing per release/arch:

hirsute
  amd64
    tests-in-lxd (F 2% f 0% S 0% B 12% => P 52%/) .BTTTBTT.BTTTTTBTTTBBTTT..F......................
    hostnamed (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    build-login (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    unit-config (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    networkd-testpy (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    localed-locale (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    boot-and-services (F 4% f 0% S 0% B 12% => P 50%/) .BTTTBTT.BTTTTTBTTTBBTTT...............F.....F...
    timedated (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    boot-smoke (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    logind (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    storage (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    upstream (F 12% f 0% S 0% B 12% => P 42%/) .BTTTBTT.BTTTTTBTTTBBTTTF.....F.F..F........F..F.
    udev (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    systemd-fsckd (F 8% f 0% S 0% B 12% => P 46%/) FBTTTBTTFBTTTTTBTTTBBTTT.............F......F....
    root-unittests (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................
    localed-x11-keymap (F 0% f 0% S 0% B 12% => P 54%/) .BTTTBTT.BTTTTTBTTTBBTTT.........................

We see that formerly we had the known to be somewhat flaky, but otherwise working test (time goes from right to left). So we had the usual suspects of tests like upstream/systemd-fsck that failed a few times, but also working runs in between.

But since February 22/24 runs failed very bad.
6 (=B) of those cases are aborting mid execution
  autopkgtest [22:08:58]: ERROR: testbed failure: testbed auxverb failed with exit code 255
And 16 of them timed out
  <VirtSubproc>: failure: Timed out on waiting for ssh connection

Sadly the 2/24 that didn't hard fail by that where broken by the known flaky systemd-fsckd test.

Something drives this test crazy that we have to find and resolve, at the current rate nothing depending on it is likely to migrate.

Related branches

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

In our worker config we see https://git.launchpad.net/autopkgtest-cloud:

long_tests timeout => ~11h
big_pkg timeout => ~5.5h
default timeout => ~3h

Currently systemd is only marked as long_test on ppc64el, so ~3h is what applies to us here.

The systemd runs that failed with timeouts are all in that range of ~200 minutes which could be the timeout plus some overhead. One important question is if it just grew a bit to now hit this or if there is one (or a few) sub-tests that now totally hang/explode.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I've compared older good and newer runs that timeout.

  $ for i in good.hirsute-amd64.*; do printf "\n\n$i\n"; grep '^autopkgtest .* test .*: - - - - - - - - - - results' $i; done

We see that good cases need ~110-130 minutes to complete "test boot-smoke" and then another 4-12 minutes to complete "test systemd-fsckd".

But the bad cases often take ~210 +/- 10 minutes to reach "test boot-smoke" completion.

But OTOH there are also a few cases needing only the old ~2h to get there (bad.hirsute-amd64.23 / bad.hirsute-amd64.24) - those still hang and eventually fail.

With some ugly awk magic I derived those times for the steps (attached here)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

On average the upstream tests loose ~55 minutes compared to former runs (~90 -> ~145 minutes).
Also that test times out:
 autopkgtest [20:14:12]: test upstream: -----------------------]
 upstream FAIL timed out

Also the later test for systemd-fsckd hangs until timeout:
 You are in emergency mode. After logging in, type "journalctl -xb" to view
 system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
 to boot into default mode.
 Press Enter for maintenance
 (or press Control-D to continue):

While the former could be tackled with marking things it long-test (seems to take longer but does not hang) the latter issue appears to be a full stop and that will consume whatever timeout we set.

Since that triggers for all kind of things in -proposed now and blocks many packages I'll set it to high, but otherwise leave it for your triage as there might already have been efforts taken ...

Changed in systemd (Ubuntu):
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From the log:
402209 Checking in progress on 1 disk (92.7% complete)fsckd:1:92.8:Checking in progress on 1 disk (92.8% complete)
402210 Checking in progress on 1 disk (92.8% complete)fsckd:1:92.9:Checking in progress on 1 disk (92.9% complete)
402211 Checking in progress on 1 disk (92.9% complete)[[0;1;31mFAILED[0m] Failed to start [0;1;39mFile Syste���ck on /dev/disk/by-label/UEFI[0m.
402212 See 'systemctl status "systemd-fsck@dev���y\\x2dlabel-UEFI.service"' for details.
402213 [[0;1;38;5;185mDEPEND[0m] Dependency failed for [0;1;39m/boot/efi[0m.
402214 [[0;1;38;5;185mDEPEND[0m] Dependency failed for [0;1;39mLocal File Systems[0m.
402215 [[0;32m OK [0m] Stopped [0;1;39mForward Password R���uests to Wall Directory Watch[0m.

But on a try to locally recreate via
  $ sudo ~/work/autopkgtest/autopkgtest/runner/autopkgtest --no-built-binaries \
    --apt-upgrade --testname=systemd-fsckd --shell systemd_247.1-4ubuntu1.dsc \
    -- qemu --ram-size=1536 --cpus 1 ~/work/autopkgtest-hirsute-amd64.img
It works just fine, so this is one of the lovely "only fails on infrastructure" cases

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Running the same on a Canonistack instance creates "different" issues and won't help much debugging it. Locally it works fine even on multiple re-runs.

@rbalint / others that feel systemd'ish: What shall we do about this short term to resolve all blocking packages?

Balint Reczey (rbalint)
summary: - autopkgtest times out (or fails before that) in hirsute
+ autopkgtest times out (or fails before that) in hirsute on test infra,
+ passes locally
Revision history for this message
Balint Reczey (rbalint) wrote :

@paelzer I seems infra issues got worse, but there is a pending review to bump systemd to big which I linked. Even in the absence of the infra issues it would have made sense to make system run big test instances, but I expect the big instances would also improve the current situation.

Instead of hinting system I'd like to give a try to running it on big instances to see it it makes any difference.

Revision history for this message
Balint Reczey (rbalint) wrote :

There is not much to do about the failures on systemd's side. The tests are reliably passing in qemu on my i5-4670 testbed from 2013, also passing on @paelzer's presumably newer machine.

This is an infra issue. The Release Team can decide if they want to hint the test and lose gating coverage, but I'm not filing the hint because IMO the test infra should be fixed.

Changed in systemd (Ubuntu):
status: New → Invalid
Revision history for this message
Iain Lane (laney) wrote : Re: [Bug 1915126] Re: autopkgtest times out (or fails before that) in hirsute on test infra, passes locally

On Thu, Feb 11, 2021 at 01:49:45PM -0000, Balint Reczey wrote:
> There is not much to do about the failures on systemd's side. The tests
> are reliably passing in qemu on my i5-4670 testbed from 2013, also
> passing on @paelzer's presumably newer machine.
>
> This is an infra issue. The Release Team can decide if they want to hint
> the test and lose gating coverage, but I'm not filing the hint because
> IMO the test infra should be fixed.

Is there chance we can be a little bit more constructive? We're not
warring factions here.

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Balint Reczey (rbalint)
tags: added: update-excuse
Revision history for this message
Iain Lane (laney) wrote :

I'm running a test on Canonistack. I'm hoping that this will reproduce the failures we're seeing, and then they can be investigated.

That is closer to what is running in production than the QEMU runner.

I don't think we have evidence that this is an 'infra issue', so I'm marking the auto-package-testing task as incomplete.

Will follow back up probably tomorrow once that test is finished.

Changed in auto-package-testing:
status: New → Incomplete
Revision history for this message
Iain Lane (laney) wrote :

Ok, here you go. I reproduced the (or a) failure in Canonistack. This was the first run FWIW, so it wasn't too hard to get it to fail:

autopkgtest [14:05:12]: test systemd-fsckd: [-----------------------
test_fsck_failure (__main__.FsckdTest)
Ensure that a failing fsck doesn't prevent fsckd to stop ... bash: line 1: 1250 Killed /tmp/autopkgtest.5SlVJo/build.mmt/src/debian/tests/systemd-fsckd 2> >(tee -a /tmp/autopkgtest.5SlVJo/systemd-fsckd-stderr >&2) > >(tee -a /tmp/autopkgtest.5SlVJo/systemd-fsckd-stdout)
autopkgtest [14:05:16]: test process requested reboot with marker test_fsck_failure
autopkgtest-virt-ssh: WARNING: ssh connection failed. Retrying in 3 seconds...
test_fsck_failure (__main__.FsckdTest)
Ensure that a failing fsck doesn't prevent fsckd to stop ... FAIL

======================================================================
FAIL: test_fsck_failure (__main__.FsckdTest)
Ensure that a failing fsck doesn't prevent fsckd to stop
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.5SlVJo/build.mmt/src/debian/tests/systemd-fsckd", line 190, in test_fsck_failure
    self.check_plymouth_start()
  File "/tmp/autopkgtest.5SlVJo/build.mmt/src/debian/tests/systemd-fsckd", line 158, in check_plymouth_start
    self.assertUnitStarted(unit)
  File "/tmp/autopkgtest.5SlVJo/build.mmt/src/debian/tests/systemd-fsckd", line 228, in assertUnitStarted
    self.assertTrue(has_unit_started(unit))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 11.289s

FAILED (failures=1)
autopkgtest [14:06:11]: test systemd-fsckd: -----------------------]
autopkgtest [14:06:12]: test systemd-fsckd: - - - - - - - - - - results - - - - - - - - - -
systemd-fsckd FAIL non-zero exit status 1
autopkgtest [14:06:13]: @@@@@@@@@@@@@@@@@@@@ summary
timedated PASS
hostnamed PASS
localed-locale PASS
localed-x11-keymap PASS
logind PASS
unit-config PASS
storage PASS
networkd-test.py PASS
build-login PASS
boot-and-services PASS
udev PASS
root-unittests PASS
tests-in-lxd PASS
upstream PASS
boot-smoke PASS
systemd-fsckd FAIL non-zero exit status 1

Full logs attached. The commandline is at the top of the log - is that enough for you to be able to have a go, Balint? I suggest adding --shell-fail of course so you can break in once it happens.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Laney for chiming in here (as one of the affected/blocked I appreciate every help).
I was last week already trying that myself on canonistack and for me it didn't reproduce.

This is the command I used:
autopkgtest --no-built-binaries --apt-upgrade --testname=systemd-fsckd --setup-commands setup-testbed --shell systemd_247.3-1ubuntu2.dsc -- ssh -s nova -- --flavor m1.small --image ubuntu/ubuntu-hirsute-daily-amd64-server-20201126-disk1.img --keyname paelzer_canonistack-bos01

I see in your log that you have run all tests, so I'm now running this:
autopkgtest --no-built-binaries --apt-upgrade --setup-commands setup-testbed --shell-fail systemd_247.3-1ubuntu2.dsc -- ssh -s nova -- --flavor m1.small --image ubuntu/ubuntu-hirsute-daily-amd64-server-20201126-disk1.img --keyname paelzer_canonistack-bos01

But I wanted to ask what else might be differing between our two repro-cases?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Also @rbalint - it seems that the success rate with 247.3-1ubuntu2 got a bit better.
Were there changes in the upload or was it actually converted to be a big_test already ?

Revision history for this message
Balint Reczey (rbalint) wrote :

@paelzer I haven't made changes related to the failure rate. I guess the infra got lighter load for some time.

Revision history for this message
Balint Reczey (rbalint) wrote :
Download full text (4.1 KiB)

@laney Thank you for looking into the problem.

I will mark systemd-fsckd as flaky and look into the root cause later. It used to be flaky before as well.

I have checked the last 10 failed runs from http://autopkgtest.ubuntu.com/packages/systemd/hirsute/amd64 , with the last one being from 2021-02-24 09:15:57 UTC:

$ cat logs
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210224_091557_d5731@/log.gz
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210224_044536_8d27f@/log.gz
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210224_021229_4c845@/log.gz
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210224_013702_2056d@/log.gz
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210223_232058_6bbff@/log.gz
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210223_212809_c332e@/log.gz
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210223_145406_d6565@/log.gz
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210223_140557_825b0@/log.gz
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210223_130532_340ec@/log.gz
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/amd64/s/systemd/20210223_115952_4304b@/log.gz

$ wget -q -i logs
$ for i in log.gz*; do zcat $i | tail -n 40 | grep -A 16 "@@@ summary" | grep -v PASS ; done
autopkgtest [09:15:42]: @@@@@@@@@@@@@@@@@@@@ summary
upstream FAIL timed out
systemd-fsckd SKIP exit status 77 and marked as skippable
autopkgtest [04:45:21]: @@@@@@@@@@@@@@@@@@@@ summary
boot-and-services FAIL non-zero exit status 1
systemd-fsckd SKIP exit status 77 and marked as skippable
autopkgtest [02:00:12]: @@@@@@@@@@@@@@@@@@@@ summary
upstream FAIL timed out
systemd-fsckd SKIP exit status 77 and marked as skippable
autopkgtest [01:23:55]: @@@@@@@@@@@@@@@@@@@@ summary
boot-and-services FAIL non-zero exit status 1
upstream FAIL timed out
systemd-fsckd SKIP exit status 77 and marked as skippable
autopkgtest [23:20:34]: @@@@@@@@@@@@@@@@@@@@ summary
boot-and-services FAIL non-zero exit status 1
upstream FAIL non-zero exit status 1
systemd-fsckd SKIP exit status 77 and marked as skippable
autopkgtest [21:19:41]: @@@@@@@@@@@@@@@@@@@@ summary
boot-and-services FAIL non-zero exit status 1
upstream FAIL timed out
systemd-fsckd SKIP exit status 77 and marked as skippable
autopkgtest [14:53:35...

Read more...

Revision history for this message
Iain Lane (laney) wrote :

Sorry for not replying to comment #11, although we chatted on IRC.

The main difference is the passing of the test name I think, but the full autopkgtest command that I used is at the top of the log. I hope using this exact command and adding --shell-fail is enough for anyone to reproduce.

Revision history for this message
Mathew Hodson (mhodson) wrote :

This bug was fixed in the package systemd - 247.3-1ubuntu4

---------------
systemd (247.3-1ubuntu4) hirsute; urgency=medium

  [ Dimitri John Ledkov ]
  * d/p/debian/UBUNTU-resolved-Mitigate-DVE-2018-0001-by-retrying-NXDOMAIN-with.patch:
    Patch updated to reduce log level to debug
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=299002546ec2d62e7f0dd7d614ba958fc9df83c2

  [ Dan Streetman ]
  * d/p/lp1906331-sd-event-ref-event-loop-while-in-sd_event_prepare-ot.patch:
    Take event reference while processing (LP: #1906331)
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=1bc38abcd3b62d317fcb62b72e26d9cb2e35ccf9
  * d/p/lp1917458-udev-rules-add-rule-to-create-dev-ptp_hyperv.patch:
    Create symlink for hyperv-provided ptp device (LP: #1917458)
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=8f1ee790ad66395457ca64cb5f8a01fdd8aabe47

  [ Balint Reczey ]
  * Pick proposed patch for not returning early in udevadm (LP: #1914062)
    File: debian/patches/lp1914062-udevadm-don-t-return-early.patch
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=d8c80751a97b0c6c4df972f6f8325293aa1607c4
  * debian/tests/control: Mark systemd-fsckd flaky again.
    As promised in LP: 1915126, until further investigation.
    File: debian/tests/control
    https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/commit/?id=68fbaab272af81aab29497f7c6a3e4e6e9aa091b

 -- Balint Reczey <email address hidden> Thu, 04 Mar 2021 12:19:05 +0100

Changed in systemd (Ubuntu):
status: Invalid → Fix Released
Revision history for this message
Balint Reczey (rbalint) wrote :

I've split the upstream test in 247.3-5ubuntu1 and also forwarded the change:
https://salsa.debian.org/systemd-team/systemd/-/merge_requests/128

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers