canonical-certification-server crashes on resume of a session

Bug #1729866 reported by Rod Smith on 2017-11-03
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Next Generation Checkbox (CLI)
High
Jeff Lane

Bug Description

I'm running multiple 17.10 regression test runs on boldore, a Cisco UCS C240 M4 (VIC) server in 1SS, because it's showing inconsistent and flaky behavior on some CPU tests, including occasional system hangs. Unfortunately, upon resuming, I'm seeing the following:

$ canonical-certification-server
=========================[ Resume Incomplete Session ]==========================
There is 1 incomplete session that might be resumed
Do you want to resume session 'canonical-certification-server-2017-11-02T21.38.45'?
  r => resume this session
  n => next session
  c => create new session
  d => delete old sessions
[rncd]: r
WARNING:plainbox.ctrl:local script com.canonical.certification::environment returned invalid RFC822 data: Job has a duplicate key '' with old value 'DO\n' and new value 'cd\n' (line 33)
WARNING:plainbox.ctrl:Ignoring generated job with missing template parameter name
WARNING:plainbox.ctrl:Ignoring generated job with missing template parameter name
ERROR:plainbox.bug:Undeclared exception PermissionError raised from resume_session

                        Application Malfunction Detected

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/guacamole/core.py", line 248, in eat
    return self._dispatch()
  File "/usr/lib/python3/dist-packages/guacamole/core.py", line 298, in _dispatch
    result = ingredient.dispatch(self.context)
  File "/usr/lib/python3/dist-packages/guacamole/ingredients/cmdtree.py", line 152, in dispatch
    return self._dispatch(context, 0)
  File "/usr/lib/python3/dist-packages/guacamole/ingredients/cmdtree.py", line 179, in _dispatch
    return self._dispatch_None(context, level, retval, command)
  File "/usr/lib/python3/dist-packages/guacamole/ingredients/cmdtree.py", line 208, in _dispatch_None
    return self._dispatch(context, level + 1)
  File "/usr/lib/python3/dist-packages/guacamole/ingredients/cmdtree.py", line 165, in _dispatch
    retval = command.invoked(context)
  File "/usr/lib/python3/dist-packages/checkbox_ng/launcher/subcommands.py", line 214, in invoked
    if not self._maybe_resume_session():
  File "/usr/lib/python3/dist-packages/checkbox_ng/launcher/subcommands.py", line 309, in _maybe_resume_session
    return self._run_resume_ui_loop(resume_candidates)
  File "/usr/lib/python3/dist-packages/checkbox_ng/launcher/subcommands.py", line 326, in _run_resume_ui_loop
    self._resume_session(candidate)
  File "/usr/lib/python3/dist-packages/checkbox_ng/launcher/subcommands.py", line 334, in _resume_session
    metadata = self.ctx.sa.resume_session(session.id)
  File "/usr/lib/python3/dist-packages/plainbox/impl/decorators.py", line 142, in wrapper
    raise exc
  File "/usr/lib/python3/dist-packages/plainbox/impl/decorators.py", line 136, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/assistant.py", line 643, in resume_session
    self._restart_strategy.diffuse_application_restart(self._app_id)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/restart.py", line 122, in diffuse_application_restart
    os.remove(filename)
PermissionError: [Errno 13] Permission denied: '/home/ubuntu/.config/autostart/com.canonical:checkbox-cli.desktop'

  Please report a bug including the information from the paragraph above. To
  report the bug visit https://bugs.launchpad.net/checkbox-ng/+filebug

  We are sorry for the inconvenience!

I noticed the "PermissionError" warning, and saw that /home/ubuntu/.config was owned by root. I therefore changed the ownership of that whole tree ("sudo chown -R ubuntu: ~/.config"), and was able to continue. I'm guessing that something in Checkbox was running via sudo, created the ~/.config directory, and that in turn caused problems later when something NOT run via sudo tried to access that directory.

I'm attaching a copy of the ~/.cache/plainbox directory tree for the server in question.

Rod Smith (rodsmith) wrote :
information type: Private → Public
Jeff Lane (bladernr) wrote :

I noticed this today too. It looks like for whatever reason running lxd init in the lxd test as root is causing this. I think it's safe to drop the enhanced permissions, but I'll need to check that out more thoroughly.

Changed in checkbox-ng:
assignee: nobody → Jeff Lane (bladernr)
importance: Undecided → Critical
status: New → In Progress
Jeff Lane (bladernr) wrote :
Download full text (3.9 KiB)

OK... so the following steps occur and the results are:

lxd init --auto
** creates .config/lxd/config.yml owned by root. .config/ is still owned by ubuntu

lxc image copy ubuntu:$VERSION local: --alias $ALIAS
** downloads the image, nothing changes in .config

lxc launch $IMAGE $CONTAINERNAME
** launches container based on the image downloaded. No change to .config

lxc list
** lists the running containers. No change to .config

lxc exec perms-test dd if=/dev/urandom of=testdata.txt bs=1024 count=1000
** creates a file in the container. no change to .config

lxc image delete artful-test
** deletes the image file. No change to .config

lxc delete --force perms-test
** halts and deletes the container. No change to .config.

So now I try running the whole test, rather than the individual commands, via sudo.

ubuntu@lalande:~$ sudo /usr/lib/plainbox-provider-checkbox/bin/virtualization --debug lxd
DEBUG:root:Executing LXD Test
DEBUG:root:Attempting to initialize LXD
DEBUG:root:Command lxd init --auto:
DEBUG:root: STDERR: LXD has been successfully configured.

DEBUG:root:No local image available, attempting to import from default remote.
DEBUG:root:Command lxc image copy ubuntu:17.10 local: --alias c37ad3ae737f4759ac65c9f4e3d79fea:
DEBUG:root: STDOUT:
Copying the image: metadata: 100% (605.11MB/s)

Copying the image: rootfs: 1% (1.13MB/s)

Copying the image: rootfs: 2% (1.75MB/s)

Copying the image: rootfs: 3% (2.22MB/s)

Copying the image: rootfs: 4% (2.69MB/s)

Copying the image: rootfs: 5% (3.10MB/s)

Copying the image: rootfs: 6% (3.53MB/s)
<SNIP>
Copying the image: rootfs: 99% (22.93MB/s)

Copying the image: rootfs: 100% (23.02MB/s)

Image copied successfully!

DEBUG:root:Launching container
DEBUG:root:Command lxc launch c37ad3ae737f4759ac65c9f4e3d79fea testbed:
DEBUG:root: STDOUT: Creating testbed

Starting testbed

DEBUG:root:Container listing:
DEBUG:root:Command lxc list:
DEBUG:root: STDOUT: +---------+---------+------+------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------+---------+------+------+------------+-----------+
| testbed | RUNNING | | | PERSISTENT | 0 |
+---------+---------+------+------+------------+-----------+

DEBUG:root:Testing container
DEBUG:root:Command lxc exec testbed dd if=/dev/urandom of=testdata.txt bs=1024 count=1000:
DEBUG:root: STDERR: 1000+0 records in
1000+0 records out
1024000 bytes (1.0 MB, 1000 KiB) copied, 0.0336225 s, 30.5 MB/s

DEBUG:root:Cleaning up images and containers created during test
DEBUG:root:Command lxc image delete c37ad3ae737f4759ac65c9f4e3d79fea:
DEBUG:root: Command returned no output
DEBUG:root:Command lxc delete --force testbed:
DEBUG:root: Command returned no output
PASS: Container was succssfully started and checked

ubuntu@lalande:~$ ls -la
t...

Read more...

Jeff Lane (bladernr) wrote :

Next I'm running ONLY the LXD test via checkbox-ng
 ☑ : info/network_devices
 ☑ : Attach a copy of /proc/cmdline
 ☑ : Verify LXD container launches
file:///home/ubuntu/.local/share/checkbox-ng/submission_2017-11-03T18.36.16.612654.xml
file:///home/ubuntu/.local/share/checkbox-ng/submission_2017-11-03T18.36.16.612654.html
file:///home/ubuntu/.local/share/checkbox-ng/submission_2017-11-03T18.36.16.612654.junit.xml
file:///home/ubuntu/.local/share/checkbox-ng/submission_2017-11-03T18.36.16.612654.tar.xz
file:///home/ubuntu/.local/share/checkbox-ng/submission_2017-11-03T18.36.16.612654.xlsx
Do you want to submit 'upload to certification' report?
  y => yes
  n => no
[yn]: n
ubuntu@lalande:~$ ls -la
total 36
drwxr-xr-x 6 ubuntu ubuntu 4096 Nov 3 18:36 .
drwxr-xr-x 3 root root 4096 Nov 2 13:49 ..
-rw-r--r-- 1 ubuntu ubuntu 220 May 17 17:32 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 May 17 17:32 .bashrc
drwx------ 3 ubuntu ubuntu 4096 Nov 3 14:44 .cache
drwxr-x--- 3 ubuntu ubuntu 4096 Nov 3 18:35 .config
drwxrwxr-x 3 ubuntu ubuntu 4096 Nov 3 18:36 .local
-rw-r--r-- 1 ubuntu ubuntu 675 May 17 17:32 .profile
drwx------ 2 ubuntu ubuntu 4096 Nov 2 13:50 .ssh
-rw-r--r-- 1 ubuntu ubuntu 0 Nov 3 14:44 .sudo_as_admin_successful
ubuntu@lalande:~$ ls -la .config/
total 12
drwxr-x--- 3 ubuntu ubuntu 4096 Nov 3 18:35 .
drwxr-xr-x 6 ubuntu ubuntu 4096 Nov 3 18:36 ..
drwxr-x--- 2 root root 4096 Nov 3 18:35 lxc

Testing is successful, and .config is unchanged. So this isn't related to the LXC test as I first thought. Something else is occurring in a different job that's changing ownership of .config to root. :/

Changed in checkbox-ng:
importance: Critical → High
status: In Progress → Confirmed
Jeff Lane (bladernr) on 2017-11-06
Changed in checkbox-ng:
assignee: Jeff Lane (bladernr) → nobody
Jeff Lane (bladernr) wrote :

Found the root cause, finally. it IS the LXC test indirectly. .config doesn't exist by default on MAAS installs (not sure about ISO installs or desktop)

So if the LXC test is run as root, it creates .config/lxc/ owned root.root, which means .config is also owned by root. THIS is what causes the failure on rerun because the ubuntu user can't access .config owned by root.

IF .config exists BEFORE checkbox runs, then only the lxc dir created by the test is owned by root, so retests work fine.

I've tested this out and running the LXC test as the default user works find, root permissions are not needed. So I'll update this shortly with a fix.

Changed in checkbox-ng:
assignee: nobody → Jeff Lane (bladernr)
Jeff Lane (bladernr) on 2017-11-14
Changed in checkbox-ng:
status: Confirmed → Fix Committed
milestone: none → 0.36.0
tags: added: hwcert-server
Changed in checkbox-ng:
milestone: 0.36.0 → 1.0.0
Changed in checkbox-ng:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers