autopkgtest lxd provider tests fail for 2.0

Bug #1571082 reported by Martin Packman
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Unassigned
juju-core (Ubuntu)
Fix Released
Critical
Unassigned

Bug Description

The new autopkgtests for the lxd provider are still not passing yet. The bridge setup looks like it works, but when juju comes to try and connect to lxd it is rejected on an http url.

<http://autopkgtest.ubuntu.com/packages/j/juju-core/>

<https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-xenial/xenial/amd64/j/juju-core/20160415_081159@/log.gz>

+ juju bootstrap my-controller lxd --upload-tools --debug --config default-series=xenial
2016-04-15 07:59:35 INFO juju.cmd supercommand.go:60 running juju [2.0-beta4 gc go1.6.1]
2016-04-15 07:59:35 INFO cmd cmd.go:141 cloud "lxd" not found, trying as a provider name
2016-04-15 07:59:35 INFO cmd cmd.go:141 no credentials found, checking environment
2016-04-15 07:59:35 DEBUG juju.cmd.juju.commands bootstrap.go:365 preparing controller with config: map[controller-uuid:644108a3-a3fc-4769-8e4f-c236232d4764 default-series:xenial type:lxd name:admin uuid:644108a3-a3fc-4769-8e4f-c236232d4764]
2016-04-15 07:59:39 DEBUG juju.tools.lxdclient client.go:73 connecting to LXD remote "local": ""
2016-04-15 07:59:46 DEBUG juju.tools.lxdclient client.go:73 connecting to LXD remote "juju-remote": "10.0.8.1"
2016-04-15 07:59:47 ERROR cmd supercommand.go:448 Get https://10.0.8.1:8443/1.0/profiles: Forbidden
adt-run [07:59:48]: test current-lxd-provider: -----------------------]
adt-run [07:59:48]: test current-lxd-provider: - - - - - - - - - - results - - - - - - - - - -
current-lxd-provider FAIL non-zero exit status 1

This is particularly confusing, as the exact same test passes if run on a canonistack machine using the ssh runner:

<http://paste.ubuntu.com/15852422/>

Test scripts are in the debian/tests directory of this packaging branch:

<https://code.launchpad.net/~juju-qa/ubuntu/xenial/juju/xenial-2.0-beta4>

From discussion on irc earlier, the only way Tycho thought we could hit the Forbidden here:

[16:18] <tych0> mgz: so juju configures LXD to listen over HTTPS
[16:18] <tych0> and if it accesses LXD via https, it needs to also configure LXD to recognize juju's cert
[16:18] <tych0> it strikes me that that cert configuration has failed
[16:19] <tych0> and that's why you get a 403
[16:19] <tych0> i'm not sure why you'd get it for any other reason

Martin Packman (gz)
Changed in juju-core:
importance: Undecided → Critical
milestone: none → 2.0-rc1
status: New → Triaged
tags: added: lxd-provider packaging
Revision history for this message
Steve Langasek (vorlon) wrote :
Download full text (3.3 KiB)

With less than a week to go to 16.04 release date, the status is:
 - juju-core-1 is in xenial
 - juju-core 1.25 is also in xenial
 - juju-core 2 is blocked in xenial-proposed because of test failures.

This situation is clearly not releasable; we need to get juju-core 2 into xenial for release. This is also blocking several other related packages from migrating to xenial for release (closure, bigdata, openstack).

I have attempted to get a clean baseline locally with the autopkgtests, using the adt-virt-ssh backend (to a xenial amd64 VM) to avoid any problems with nested lxd or otherwise. The current-lxd-provider test failed for me as follows:

2016-04-16 21:51:33 DEBUG juju.provider.lxd environ_broker.go:126 LXD requires https://, using: https://cloud-images.ubuntu.com/releases/
2016-04-16 21:51:33 DEBUG juju.tools.lxdclient client.go:73 connecting to LXD remote "default cloud images": "https://streams.canonical.com/juju/images/releases/"
2016-04-16 21:51:33 INFO juju.tools.lxdclient client_image.go:129 no image for ubuntu-xenial found in https://streams.canonical.com/juju/images/releases/
2016-04-16 21:51:33 DEBUG juju.tools.lxdclient client.go:73 connecting to LXD remote "default ubuntu cloud images": "https://cloud-images.ubuntu.com/releases/"
2016-04-16 21:51:34 INFO juju.tools.lxdclient client_image.go:135 found image from https://cloud-images.ubuntu.com/releases/ for xenial = 6cb0ba80a5fe32357568a473cbaf69f14d26da0ba6b08f5b1bcde7053fc73757
2016-04-16 21:51:34 INFO juju.tools.lxdclient client_image.go:147 copying image for ubuntu-xenial from https://cloud-images.ubuntu.com/releases/: 1%
[...]
copying image for ubuntu-xenial from https://cloud-images.ubuntu.com/releases/: 2016-04-16 21:51:37 INFO juju.tools.lxdclient client_image.go:147 copying image for ubuntu-xenial from https://cloud-images.ubuntu.com/releases/: 100%
copying image for ubuntu-xenial from https://cloud-images.ubuntu.com/releases/: 2016-04-16 21:51:38 DEBUG juju.tools.lxdclient client_image.go:139 dropped 0 progress messages
2016-04-16 21:51:38 DEBUG juju.service discovery.go:59 discovered init system "systemd" from series "xenial"
2016-04-16 21:51:38 DEBUG juju.provider.lxd environ_broker.go:223 LXD user data; 1342 bytes
2016-04-16 21:51:38 INFO juju.provider.lxd environ_broker.go:201 starting instance "juju-b43656cd-a9d3-4c92-80ac-599997219ad7-machine-0" (image "ubuntu-xenial")...
2016-04-16 21:52:04 INFO juju.provider.common destroy.go:22 destroying model "admin"
2016-04-16 21:52:04 INFO juju.provider.common destroy.go:33 destroying instances
2016-04-16 21:52:04 INFO juju.provider.common destroy.go:53 destroying storage
2016-04-16 21:52:04 ERROR cmd supercommand.go:448 failed to bootstrap model: cannot start bootstrap instance: exit status 2
adt-run [14:52:05]: test current-lxd-provider: -----------------------]
adt-run [14:52:06]: test current-lxd-provider: - - - - - - - - - - results - - - - - - - - - -
current-lxd-provider FAIL non-zero exit status 1

If I could get this test to pass locally, I would go ahead with letting juju-core 2.0 in as-is and trust that the test incompatibilities with the autopkgtest infrastructure would be sor...

Read more...

Changed in juju-core (Ubuntu):
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Martin Packman (gz) wrote :

I have some changes from Friday to address these test failures and the armhf one, that I've now committed and put in the packaging branch:

<lp:~juju-qa/ubuntu/xenial/juju/xenial-2.0-beta4>
<http://bazaar.launchpad.net/~juju-qa/ubuntu/xenial/juju/xenial-2.0-beta4/revision/108>

As the lxd changes are mostly about making the test cleaner and adding some extra debug information, and not actually addressing the failure seen, I was hesitant to push for a new upload just to get more results.

However, they may well be useful for showing how your test failed, as it's distinctly unclear why from the pasted log.

I agree that migrating from proposed without understanding this better is needless risk, even with the slightly confused state of the archive given the adoption of the new 1.25 packaging.

Revision history for this message
Martin Pitt (pitti) wrote :

Not sure if this is the same issue, but I get a hang locally with "adt-run --apt-pocket=proposed -U juju-core --- qemu adt-xenial-amd64-cloud.img". The last thing that I see is:

adt-run [11:33:28]: test current-lxd-provider: [-----------------------
[...]
2016-04-18 09:34:36 INFO juju.tools.lxdclient client_image.go:147 copying image for ubuntu-xenial from https://cloud-images.ubuntu.com/releases/: 100%
[...]
2016-04-18 09:35:14 INFO cmd cmd.go:129 Installing Juju agent on bootstrap instance
2016-04-18 09:35:15 INFO cmd cmd.go:129 Preparing for Juju GUI 2.1.2 release installation
2016-04-18 09:35:15 DEBUG juju.cloudconfig.instancecfg instancecfg.go:566 Setting numa ctl preference to false
Waiting for address
Attempting to connect to 10.0.8.201:22
2016-04-18 09:35:16 DEBUG juju.utils.ssh ssh.go:249 using OpenSSH ssh client
sudo: unable to resolve host juju-a82c73b5-c2b8-4d49-8312-e6cc336c3ff3-machine-0
Logging to /var/log/cloud-init-output.log on remote host
Running apt-get update
Running apt-get upgrade

When I ssh into the QEMU instance, install lxd-client, and enter the container, there is no apt-get running any more, but /var/log/cloud-init-output.log shows an error:

Setting up open-iscsi (2.0.873+git0.3b4b4500-14ubuntu3) ...
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
Job for iscsid.service failed because a configured resource limit was exceeded. See "systemctl status iscsid.service" and "journalctl -xe" for details.
invoke-rc.d: initscript iscsid, action "restart" failed.
dpkg: error processing package open-iscsi (--configure):
 subprocess installed post-installation script returned error exit status 1
Processing triggers for initramfs-tools (0.122ubuntu8) ...
Errors were encountered while processing:
 open-iscsi

Apparently open-iscsi does not like being in an lxd container:

$ lxc launch images:ubuntu/xenial/amd64 x1
$ lxc exec x1 -- apt-get install -y open-iscsi
Setting up open-iscsi (2.0.873+git0.3b4b4500-14ubuntu3) ...
Job for open-iscsi.service failed because the control process exited with error code. See "systemctl status open-iscsi.service" and "journalctl -xe" for details.

Curtis Hovey (sinzui)
tags: added: jujuqa
Revision history for this message
Tycho Andersen (tycho-s) wrote :

Hi Steve,

On the off chance you still have the machine you used as the backend for your adt run laying around, can you post the contents of /var/log/lxd somewhere? It looks to me like it never actually hit LXD, but let's just check to be sure.

Revision history for this message
Steve Langasek (vorlon) wrote :

Tycho,

lxd logs attached.

Revision history for this message
Tycho Andersen (tycho-s) wrote : Re: [Bug 1571082] Re: autopkgtest lxd provider tests fail for 2.0

Thanks, Steve.

On Mon, Apr 18, 2016 at 10:59:37PM -0000, Steve Langasek wrote:
> Tycho,
>
> lxd logs attached.
>
> ** Attachment added: "1571082-lxd-logs.tar.xz"
> https://bugs.launchpad.net/ubuntu/+source/juju-core/+bug/1571082/+attachment/4639541/+files/1571082-lxd-logs.tar.xz

Unfortunately, there's not a lot to go on here. lxc.log is basically empty, but
doesn't log any failures. lxd's log does have:

t=2016-04-16T21:51:34+0000 lvl=info msg=handling method=POST url=/1.0/images ip=10.0.8.1:38476
t=2016-04-16T21:51:34+0000 lvl=info msg=handling url=/1.0/operations/7d063725-8b02-4b53-9315-b1a2ee4b3a30/wait ip=10.0.8.1:38476 method=GET
t=2016-04-16T21:51:34+0000 lvl=info msg=handling method=GET url="/1.0/events?type=operation" ip=10.0.8.1:38482
t=2016-04-16T21:51:34+0000 lvl=info msg="Image not in the db, downloading it" image=6cb0ba80a5fe32357568a473cbaf69f14d26da0ba6b08f5b1bcde7053fc73757 server=https://cloud-images.ubuntu.com/releases/
t=2016-04-16T21:51:34+0000 lvl=info msg="Downloading the image" image=6cb0ba80a5fe32357568a473cbaf69f14d26da0ba6b08f5b1bcde7053fc73757
t=2016-04-16T21:51:38+0000 lvl=info msg=handling method=DELETE url=/1.0/images/aliases/ubuntu-xenial ip=10.0.8.1:38476
t=2016-04-16T21:51:38+0000 lvl=info msg=handling ip=10.0.8.1:38476 method=POST url=/1.0/images/aliases
t=2016-04-16T21:51:38+0000 lvl=info msg=handling method=GET url=/1.0 ip=10.0.8.1:38476
t=2016-04-16T21:51:38+0000 lvl=info msg=handling method=GET url=/1.0/images/aliases/ubuntu-xenial ip=10.0.8.1:38476
t=2016-04-16T21:51:38+0000 lvl=info msg=handling method=GET url=/1.0/images/6cb0ba80a5fe32357568a473cbaf69f14d26da0ba6b08f5b1bcde7053fc73757 ip=10.0.8.1:38476
t=2016-04-16T21:51:38+0000 lvl=info msg=handling method=POST url=/1.0/containers ip=10.0.8.1:38476
t=2016-04-16T21:51:38+0000 lvl=info msg=handling method=GET url=/1.0/operations/957e84a5-538e-4ef6-a8ce-20e113d04cce/wait ip=10.0.8.1:38476
t=2016-04-16T21:52:04+0000 lvl=info msg=handling method=GET url="/1.0/containers?recursion=1" ip=10.0.8.1:38476
t=2016-04-16T21:52:29+0000 lvl=info msg="Received 'terminated signal', exiting."
t=2016-04-16T21:52:29+0000 lvl=info msg="Stopping REST API handler:"
t=2016-04-16T21:52:29+0000 lvl=info msg=" - closing socket" socket=[::]:8443
t=2016-04-16T21:52:29+0000 lvl=info msg=" - skipping socket-activated socket" socket=/var/lib/lxd/unix.socket

so it looks like it at least downloaded the image and then started the
contianer. About 40 seconds after the container starts LXD gets restarted
(which won't stop the container or anything), but it does make you wonder why
it got restarted. If juju was connected at the time of the restart waiting for
something, I can see why this would cause it to fail. Unfortunately, I don't
think juju logs persist after a failed bootstrap.

Was there something else running on the box that might cause LXD to restart
while juju was bootstrapping?

Revision history for this message
Steve Langasek (vorlon) wrote :

On Mon, Apr 18, 2016 at 11:16:44PM -0000, Tycho Andersen wrote:
> Was there something else running on the box that might cause LXD to restart
> while juju was bootstrapping?

Nope. Dedicated canonistack instance.

The autopkgtest log output also shows this:

adt-run [16:32:51]: test current-lxd-provider: [-----------------------
+ sh debian/tests/setup-lxd.sh
+ debconf-communicate
0 value set
0 value set
0 value set
0 value set
0 value set
0 value set
0 value set
0 value set
0 value set
0 value set
0 value set
+ rm -rf /etc/default/lxd-bridge
+ dpkg-reconfigure lxd --frontend=noninteractive
Warning: Stopping lxd.service, but it can still be activated by:
  lxd.socket
+ systemctl start lxd-bridge

So it looks like the test itself restarts lxd. But I should hope that
'systemctl start lxd-bridge' is synchronous and doesn't return until lxd is
up and running?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So to be more confusing I ran the tests from https://code.launchpad.net/~vorlon/ubuntu/xenial/juju/test-improvements (which is just a single simple rev ahead of Martin's branch) on ec2 using adt-virt-null and locally using adt-virt-qemu and got different results:

$ adt-run ../*2.0~beta4-0ubuntu1*.deb --built-tree . --- qemu ~/adt-xenial-amd64-cloud.img
...
client PASS
current-lxd-provider PASS
future-lxd-provider PASS
current-manual-provider FAIL non-zero exit status 1
future-manual-provider FAIL non-zero exit status 1

Full results here http://paste.ubuntu.com/15923842/

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Oh the manual provider tests are broken by the change in Steve's branch to pass --system to adduser, resulting in the jujutest user having /bin/false for a shell.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

To be certain, I fixed that problem (pushed my branch to https://code.launchpad.net/~mwhudson/ubuntu/xenial/juju/mwhudson) and ran the test suite again and it passed. So I don't feel like I've necessarily been very productive today, but the tests do actually pass, so maybe things can proceed? The difference is between the branch I tested and what's in the archive doesn't seem that great, but perhaps it's worth uploading again and trying again? (Or maybe the real change has been in dependencies).

Revision history for this message
Martin Pitt (pitti) wrote :

I tried to run just the tests from https://code.launchpad.net/~mwhudson/ubuntu/xenial/juju/mwhudson on the production infra, and they still fail with

2016-04-19 06:14:49 ERROR cmd supercommand.go:448 Get https://10.0.8.1:8443/1.0/profiles: Forbidden

I'll re-run this with building the source package first and using those binaries, maybe you changed the actual code too and not just the tests (I don't know how many commits were done on top of the xenial-proposed package).

Revision history for this message
Martin Pitt (pitti) wrote :

This is the full log including a package build from https://code.launchpad.net/~mwhudson/ubuntu/xenial/juju/mwhudson r111. It fails on the same issue as above:

2016-04-19 06:39:08 ERROR cmd supercommand.go:448 Get https://10.0.8.1:8443/1.0/profiles: Forbidden

Revision history for this message
Martin Pitt (pitti) wrote :

Michael and I debugged this life in an affected instance, and we found the two problems:

 * The LXD bridge needs to get added to no_proxy in /etc/environment or at least in the env where the test is run. Something like appending ",10.0.8.*" (this syntax needs to be tested, as $no_proxy is seriously underdocumented)

 * The ethernet interfaces in Scalingstack instances have an MTU of 1400, which breaks networking in containers. I recently added this gem to my armhf setup script (https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/tree/tools/armf-lxd-slave.userdata#n98):

     lxc profile device remove default eth0
     lxc profile device add default eth0 nic nictype=bridged parent=lxdbr0 mtu=1400

with these two changes we get very far, final tests pending.

Revision history for this message
Martin Pitt (pitti) wrote :

I filed bug 1572026 to track the issue about the lxdbr0 MTU.

Revision history for this message
Martin Pitt (pitti) wrote :

We figured out the patches for this, tests are now passing in production, and Michael will upload a new package momentarily.

Changed in juju-core (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package juju-core - 2.0~beta4-0ubuntu2

---------------
juju-core (2.0~beta4-0ubuntu2) xenial; urgency=medium

  * Update debian/tests/ to work in the production network environment
    (LP: #1571082).

 -- Michael Hudson-Doyle <email address hidden> Tue, 19 Apr 2016 21:20:58 +1200

Changed in juju-core (Ubuntu):
status: Fix Committed → Fix Released
Martin Pitt (pitti)
Changed in juju-core:
status: Triaged → Fix Released
Revision history for this message
Martin Pitt (pitti) wrote :

I added an infrastructure-level workaround for bug 1572026, so you can drop this workaround from the juju-core test again:

-- debian/tests/setup-lxd.sh 2016-04-19 09:21:40 +0000
+++ debian/tests/setup-lxd.sh 2016-04-19 13:33:51 +0000
@@ -23,9 +23,5 @@

 dpkg-reconfigure lxd --frontend=noninteractive

-# work around LP: #1572026 -- the MTU is lower than the default 1500 in scalingstack
-lxc profile device remove default eth0
-lxc profile device add default eth0 nic nictype=bridged parent=lxdbr0 mtu=1400
-
 # Must run a command for systemd socket activation to start the service
 lxc finger

I successfully ran the juju test with this applied (i. e. with the test workaround removed) on the production infrastructure.

affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta5 → none
milestone: none → 2.0-beta5
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.