trusty/armhf dkms tests are killing workers

Bug #1654025 reported by Iain Lane
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Auto Package Testing
New
Undecided
Iain Lane
autopkgtest (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

We noticed earlier today that almost all armhf (lxd) workers were dead.

After a lot of poking, it seems to me that they are being killed when running dkms tests for the kernel in trusty-proposed.

When trying to download the source package, we get:

> autopkgtest [15:14:36]: ERROR: erroneous package: got 0 lines of results from extract where 4 expected: []
> autopkgtest: DBG: testbed stop
> autopkgtest: DBG: testbed close, scratch=/tmp/autopkgtest.Mco1QJ
> autopkgtest: DBG: sending command to testbed: close
> blame: nvidia-graphics-drivers-340
> badpkg: got 0 lines of results from extract where 4 expected: []

Tags: patch
Revision history for this message
Iain Lane (laney) wrote :

Not sure I'm going to get to the bottom of this today.

Here's a way to reproduce what I *think* (not sure) is the same thing on amd64. I took it from one of the tests which killed a worker. Make sure lxd is working for you.

$ autopkgtest-build-lxd images:ubuntu/trusty/amd64
$ autopkgtest --output-dir /tmp/laney/out/ --apt-pocket=proposed=src:linux-meta,src:linux --apt-upgrade nvidia-graphics-drivers-340 --debug --setup-commands 'apt-get install -y linux-image linux-headers || apt-get install -y linux-image-generic linux-headers-generic' -- lxd autopkgtest/ubuntu/trusty/amd64
[…]
> + apt-get source -q --only-source nvidia-graphics-drivers-340=340.98-0ubuntu0.14.04.1
> autopkgtest: DBG: testbed command exited with code 0
> blame: nvidia-graphics-drivers-340
> badpkg: got 0 lines of results from extract where 4 expected: [] (out: "", err: "None")

I've tried various permutations here, but I didn't pin it down yet. Somehow apt-get is exiting and terminating the whole script early with exit status 0 - the subsequent commands aren't running.

https://git.launchpad.net/~ubuntu-release/autopkgtest/+git/autopkgtest/tree/runner/autopkgtest#n389

Revision history for this message
Iain Lane (laney) wrote :

Inserted an strace into the apt-get source, not sure what's going on

Revision history for this message
Iain Lane (laney) wrote :

s/trusty/zesty/ works, FWIW

Revision history for this message
Iain Lane (laney) wrote :

'lxc exec'ing the same command seems to always work, as does running bash and then poking things. I didn't yet try to do every single thing that autopkgtest does manually.

I did notice that rebooting took a little while to happen when I ran 'reboot' from within a container, so a theory to investigate is whether this is actually a delayed previous reboot request taking down the script. nvidia is quite big so it might be acting as a synchronisation point making the apt-get source line of investigation a red herring.

Revision history for this message
Barry Warsaw (barry) wrote : Re: [Bug 1654025] Re: trusty/armhf dkms tests are killing workers

On Jan 04, 2017, at 10:28 PM, Iain Lane wrote:

>I did notice that rebooting took a little while to happen when I ran
>'reboot' from within a container, so a theory to investigate is whether
>this is actually a delayed previous reboot request taking down the
>script. nvidia is quite big so it might be acting as a synchronisation
>point making the apt-get source line of investigation a red herring.

Note that in autopkgtests you are *not* supposed to call reboot directly, but
instead /tmp/autopkgtest-reboot because otherwise no state is preserved
between reboots. You also need to set the needs-reboot restriction, and be
careful about which testbeds you use.

See README.package-tests.rst for an extensive discussion about tests which
need to reboot.

Revision history for this message
Iain Lane (laney) wrote :

It's autopkgtest itself rebooting after installing a new kernel. I was just theorising that maybe this isn't properly waiting for the reboot to finish.

Revision history for this message
Iain Lane (laney) wrote :

Haven't yet been dissuaded from that line of thinking:

autopkgtest [11:18:29]: rebooting testbed after setup commands that affected boot
autopkgtest: DBG: sending command to testbed: reboot
autopkgtest-virt-lxd: DBG: executing reboot
[…]
autopkgtest-virt-lxd: DBG: hook_wait_reboot: waiting for container to shut down...
autopkgtest-virt-lxd: DBG: execute-timeout: lxc exec autopkgtest-lxd-mirgzk sleep 3600
autopkgtest-virt-lxd: DBG: hook_wait_reboot: container restarted, waiting for boot to finish
autopkgtest-virt-lxd: DBG: execute-timeout: lxc exec autopkgtest-lxd-mirgzk runlevel
autopkgtest-virt-lxd: DBG: execute-timeout: lxc exec autopkgtest-lxd-mirgzk -- uptime -s
autopkgtest-virt-lxd: DBG: wait_booted: container up since 2017-01-05 11:15:59

I'd expect that last timestamp to be >= 11:18:29

Revision history for this message
Iain Lane (laney) wrote :

Inserted a time.sleep(30) in autopkgtest-virt-lxd/hook_wait_reboot, and now the test runs. So now I just have to find out why we don't wait for reboot properly.

Revision history for this message
Iain Lane (laney) wrote :

I fixed ("fixed"?) this by removing the use of 'lxc exec ... -- sleep 3600' as a synchronisation method in favour of checking /proc/uptime directly - when that goes backwards, we've rebooted.

I don't know why that didn't work - I didn't reproduce the misbehavour when using lxd directly myself outside of autopkgtest. I also don't know why it seemed happened on trusty only. Possibly a slightly slower/faster 'reboot'?

Can't link the branch for some reason, but https://git.launchpad.net/~laney/autopkgtest/+git/autopkgtest/commit/?id=ed8f754790e3a03cee467d3985680f93ac0e2c63 - trying to find out the best way to get that reviewed/landed in autopkgtest upstream.

Changed in auto-package-testing:
assignee: nobody → Iain Lane (laney)
Revision history for this message
Iain Lane (laney) wrote :

I cowboyed this in to test it live, and requeued the tests.

Revision history for this message
Iain Lane (laney) wrote :

Still waiting for review (comment #9) if you have time

Revision history for this message
Martin Pitt (pitti) wrote :

Hello Iain,

your patch had a few tests/pycodestyle errors, and I squashed them together. However, when testing this I sometimes run into a timeout:

$ runner/autopkgtest tests/testpkg-reboot -- lxd autopkgtest/ubuntu/zesty/amd64
autopkgtest [10:43:57]: git checkout: 77cd1c3 doc/README.package-tests.rst: Document network access
autopkgtest [10:43:57]: host donald; command line: runner/autopkgtest tests/testpkg-reboot -- lxd autopkgtest/ubuntu/zesty/amd64
autopkgtest [10:44:10]: testbed dpkg architecture: amd64
autopkgtest [10:44:12]: testbed running kernel: Linux 4.8.0-29-generic #31~lp1626436ProposedWithTwoPatches SMP Fri Nov 18 18:21:38 UTC
autopkgtest [10:44:12]: @@@@@@@@@@@@@@@@@@@@ built-tree tests/testpkg-reboot
autopkgtest [10:44:12]: test p: preparing testbed
autopkgtest [10:44:14]: test p: [-----------------------
bash: line 1: 441 Killed /tmp/autopkgtest.Hoe2Rq/tree/debian/tests/p 2> >(tee -a /tmp/autopkgtest.Hoe2Rq/p-stderr >&2) > >(tee -a /tmp/autopkgtest.Hoe2Rq/p-stdout)
autopkgtest [10:44:15]: test process requested reboot with marker phase1
error: Container is not running.
<VirtSubproc>: failure: timed out waiting for container autopkgtest-lxd-ydlkyp to restart
autopkgtest [10:46:06]: ERROR: testbed failure: cannot send to testbed: ['BrokenPipeError: [Errno 32] Broken pipe\n']

so this doesn't seem to be very reliable.

How does it fare on the production infra?

Martin Pitt (pitti)
Changed in autopkgtest (Ubuntu):
status: New → In Progress
tags: added: patch
Revision history for this message
Iain Lane (laney) wrote :

Hmm, yeah, for me it failed 4 times out of 100 runs. That's better than the previous state which failed every time I tried.

Thanks for the pycodestyle - I didn't run that one.

Ah, when it fails we see

  autopkgtest-virt-lxd: DBG: hook_wait_reboot: container now up for 4.0 - has not rebooted (initial uptime 10.0)

I think that's a lexicographic comparison issue. We should return a float from get_uptime().

Force pushed to the same place. I'm currently running the testsuite with AUTOPKGTEST_TEST_LXD set, will comment again if there's any regression.

Revision history for this message
Martin Pitt (pitti) wrote :

string vs. float comparison makes sense indeed. It could also be that by testpkg-reboot/'s nature, rebooting is pretty much the first thing it does, so it could happen that the uptime before reboot is even smaller than the first time we check after reboot -- but that's a bit academic in practice.

This works fairly well now, so I pushed this with a small addition to get_uptime() to redirect stderr to /dev/null. Otherwise you get some potentially misleading "error: Container is not running." messages. Thanks!

Revision history for this message
Martin Pitt (pitti) wrote :
Changed in autopkgtest (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Iain Lane (laney) wrote :

Thanks pitti!

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package autopkgtest - 5.0

---------------
autopkgtest (5.0) unstable; urgency=medium

  [ Martin Pitt ]
  * Drop obsolete adt-* CLI
    autopkgtest 4.0 with its "autopkgtest" program has been around for a
    year, Debian's and Ubuntu's CI moved to that, and the Debian 9.0 release
    is behind us. Drop "adt-run" and the corresponding adt-virt-* and
    adt-build-* command line interfaces now.
    Add Breaks: for debci for versions that still relied on adt-*.
  * lxc, qemu, ssh: Shorten path of shared directory.
    This helps to reduce socket path lengths in tests.
    (See Debian #860554, LP: #1680577)
  * Drop obsolete ubuntu-touch-session setup script.
    This hasn't been maintained/tested for ages, and with the demise of
    Ubuntu Touch this is now completely obsolete. (Closes: #835836)
  * Avoid bare "except:" clauses
  * tests/autopkgtest_args: pyflakes/pycodestyle fixes
  * tests: Don't run LXC tests with --ephemeral.
    It's not the default and ephemeral containers tend to fail to start too
    often.
  * tests/testpkg: Update Standards-Version and debhelper level.
    Priority "extra" is deprecated, so move to "optional".
  * README.package-tests.rst: Clarify role of control file, add simple
    examples (Closes: #870654)
  * README.package-tests.rst: Update Testsuite: documentation.
    All supported Debian releases now have a new enough dpkg, and Ubuntu
    14.04 will not get new tests any more. (Closes: #876008)
  * Run apt-ftparchive for local debs on testbed instead of host.
    This is slightly more correct as it will produce the format that the
    testbed's apt expects (although that doesn't matter much). It also
    allows autopkgtest to be run on non-Debian hosts.
    As apt-utils is not essential, install it if necessary.
  * README.package-tests.rst: Clarify Testsuite: values
    `autopkgtest` is not any more the only defined value for Testsuite:, as
    the paragraph below already explains.
  * setup-commands/setup-testbed: Look at /etc/os-release for determining
    distribution name. Stop hardcoding "debian|ubuntu|kali" and get this
    information from os-release's ID field. (Closes: #868678)
  * debian/rules: Drop obsolete pysupport override.
    Not necessary any more since dh compat level 9.
  * Bump Standards-Version to 4.1.0. (No changes necessary.)

  [ Stéphane Graber ]
  * debian/tests/lxd: Update to support LXD network/storage API

  [ Steve Langasek ]
  * Fix release name parsing with arch-qualified sources.list
    (Closes: #870393)

  [ Antonio Terceiro ]
  * autopkgtest-build-lxc: Correctly clean cache for Debian containers

 -- Martin Pitt <email address hidden> Sun, 17 Sep 2017 16:50:21 +0200

Changed in autopkgtest (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.