juju2 beta11: LXD containers always pending on ppc64el systems

Bug #1605714 reported by Larry Michel
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Alexis Bruemmer

Bug Description

We are seeing that LXD containers on ppc64el systems are always in the pending state.

From juju_status.yaml, this is for a PowerNV system:

machines:
  "0":
    juju-status:
      current: started
      since: 22 Jul 2016 07:06:51Z
      version: 2.0-beta11
    dns-name: 10.245.0.225
    instance-id: 4y3hmr
    machine-status:
      current: running
      message: Deployed
      since: 22 Jul 2016 07:04:29Z
    series: trusty
    containers:
      0/lxd/0:
        juju-status:
          current: pending
          since: 22 Jul 2016 06:56:57Z
        instance-id: pending
        machine-status:
          current: allocating
          message: Starting container
          since: 22 Jul 2016 07:09:00Z
        series: trusty
      0/lxd/1:
        juju-status:
          current: pending
          since: 22 Jul 2016 06:56:59Z
        instance-id: pending
        machine-status:
          current: pending
          since: 22 Jul 2016 06:56:59Z
        series: trusty
    hardware: arch=ppc64el cpu-cores=128 mem=32565M tags=hardware-ibm-power8-S822LC,whitelist-compute,hw-production,entei
      availability-zone=Production

I have seen this recreated on ppc64el VMs as well.

+ echo 'JUJU_VERSION =' 2.0-beta11-0ubuntu1~14.04.1~juju1 2.0-beta11-0ubuntu1~14.04.1~juju1 2.3.6-0~38~ubuntu14.04.1
JUJU_VERSION = 2.0-beta11-0ubuntu1~14.04.1~juju1 2.0-beta11-0ubuntu1~14.04.1~juju1 2.3.6-0~38~ubuntu14.04.1

Attaching the logs from that system and bootstrap 0 machine.

Tags: ateam oil oil-2.0
Revision history for this message
Larry Michel (lmic) wrote :
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Larry,

I can see that you are running on trusty. According to a comment by Stephane Graber on a similar bug (https://bugs.launchpad.net/juju-core/+bug/1600311), "The release kernel for trusty (3.13) isn't capable of seccomp and so containers just fail to start."

In fact, I believe that this is a duplicate and will mark it as such. Please re-open if you disagree with reasons \o/

Revision history for this message
Larry Michel (lmic) wrote :
Download full text (3.7 KiB)

Anastasia,

I don't think this is a duplicate. I tested it with the hwe-x kernel and it still fails. Also, it's always the ppc64el containers that are failing:

machines:
  "0":
    juju-status:
      current: started
      since: 28 Jul 2016 21:39:22Z
      version: 2.0-beta13
    dns-name: 10.245.0.241
    instance-id: 4y3hgx
    machine-status:
      current: running
      message: Deployed
      since: 28 Jul 2016 21:37:42Z
    series: trusty
    containers:
      0/lxd/0:
        juju-status:
          current: started
          since: 28 Jul 2016 21:41:16Z
          version: 2.0-beta13
        dns-name: 10.245.0.173
        instance-id: juju-fac225-0-lxd-0
        machine-status:
          current: running
          message: Container started
          since: 28 Jul 2016 21:40:45Z
        series: trusty
      0/lxd/1:
        juju-status:
          current: started
          since: 28 Jul 2016 21:40:39Z
          version: 2.0-beta13
        dns-name: 10.245.0.171
        instance-id: juju-fac225-0-lxd-1
        machine-status:
          current: running
          message: Container started
          since: 28 Jul 2016 21:40:04Z
        series: trusty
      0/lxd/2:
        juju-status:
          current: started
          since: 28 Jul 2016 21:40:55Z
          version: 2.0-beta13
        dns-name: 10.245.0.172
        instance-id: juju-fac225-0-lxd-2
        machine-status:
          current: running
          message: Container started
          since: 28 Jul 2016 21:40:25Z
        series: trusty
    hardware: arch=amd64 cpu-cores=24 mem=196608M tags=hardware-cisco-c240-m3,hw-larrymi,moline,cisco-owned
      availability-zone=Production
  "1":
    juju-status:
      current: started
      since: 28 Jul 2016 21:36:47Z
      version: 2.0-beta13
    dns-name: 10.245.0.157
    instance-id: 4y3hds
    machine-status:
      current: running
      message: Deployed
      since: 28 Jul 2016 21:35:47Z
    series: trusty
    containers:
      1/lxd/0:
        juju-status:
          current: pending
          since: 28 Jul 2016 21:31:04Z
        instance-id: pending
        machine-status:
          current: allocating
          message: Starting container
          since: 28 Jul 2016 21:41:01Z
        series: trusty
    hardware: arch=ppc64el cpu-cores=4 mem=8229M tags=hardware-ibm-power8-S822L,blacklist-juno,blacklist-icehouse,blacklist-compute,vm-ppc64el,hw-larrymi,huffman-vm-01,blacklist-precise,hw-bug1605714
      availability-zone=default

$ ls
entei-logs.tar.gz huffman-vm-01-logs.tar.gz juju_status.yaml moline-logs.tar.gz
jenkins@s9-lmic-trusty:~/vmware/ppc64el$ juju ssh 1 'dpkg -l |grep lxd'
ii lxd 2.0.3-0ubuntu2~ubuntu14.04.1 ppc64el Container hypervisor based on LXC - daemon
ii lxd-client 2.0.3-0ubuntu2~ubuntu14.04.1 ppc64el Container hypervisor based on LXC - client
Connection to 10.245.0.157 closed.
$ juju ssh 0 'dpkg -l |grep lxd'
ii lxd 2.0.3-0ubuntu2~ubuntu14.04.1 amd64 Container hypervisor based on LXC - daemon
ii lxd-client 2.0.3-0ubuntu2~ubuntu14.04.1 amd64 Container hyp...

Read more...

Revision history for this message
Larry Michel (lmic) wrote :

Adding logs from deployment with hwe-x kernel.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Larry,

If LXD is not in backports on your trusty images, then we cannot support it.

If it is, then it is the same issue as as a bug I've referred to in comment #2.

Could you please confirm?

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

Anastasia,
Yes, LXD is in backports on our trusty images. My understanding was that if I'd switch to using hwe-x kernel, then that'd confirm that it's different issue since Stephane's comment pertained to a different kernel. I am ok with dupping back to the other bug if you think the hwe-x bit does not apply.

Changed in juju-core:
status: Incomplete → New
Changed in juju-core:
status: New → Triaged
milestone: none → 2.0.0
importance: Undecided → High
Changed in juju-core:
importance: High → Critical
milestone: 2.0.0 → 2.0-beta15
Changed in juju-core:
assignee: nobody → Richard Harding (rharding)
Changed in juju-core:
milestone: 2.0-beta15 → 2.0-beta16
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta16 → none
milestone: none → 2.0-beta16
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta16 → 2.0-beta17
Changed in juju:
assignee: Richard Harding (rharding) → Mick Gregg (macgreagoir)
Revision history for this message
Mick Gregg (macgreagoir) wrote :

I'm seeing Juju on trusty on ppc64le trying to use the LXD port, on which the host does not listen in this case. Still trying to understand why towards a fix.

Revision history for this message
Mick Gregg (macgreagoir) wrote :

@lmic I've been trying to reproduce this with beta releases up to and including tip, and I can't consistently fault it.

My test machine is ppc64le with trusty, linux-generic-lts-xenial and lxd 2.0.3.

I have seen one failure, noted in bug 1618636, but I've confirmed it was a result of that bug and worked-around it as noted (`lxc config set core.https_address [::]`) for beta16.

I'm testing with bootstrap as a lxc container deployment, which is not quite what you're testing. Can I ask you to try your test again with the upgraded versions I mention here? You will also want to add the https_address config work-around on your container host machines.

Changed in juju:
status: Triaged → In Progress
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta17 → 2.0-beta18
Changed in juju:
status: In Progress → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

Per IRC conversation with @macgreagoir yesterday, I did not have access to the ppc64el systems till yesterday. I am testing this to try to recreate. However, I don't have a snapshot less than beta17 now so will be trying with that.

Revision history for this message
Larry Michel (lmic) wrote :

@macgreagoir, I don't see any of the failures you mentioned and I also tried the workaround you mentioned after deploying the bundle. I then tried to add a unit but didn't see a new container on that ppc64el node. Finally, I tried manually creating a new container on another ppc64el node (a vm this time), and start hanged:

ubuntu@huffman-vm-01:~$ sudo lxc list
+---------------------+---------+------+------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------------+---------+------+------+------------+-----------+
| dev | STOPPED | | | PERSISTENT | 0 |
+---------------------+---------+------+------+------------+-----------+
| juju-78ce7e-2-lxd-0 | STOPPED | | | PERSISTENT | 0 |
+---------------------+---------+------+------+------------+-----------+

ubuntu@huffman-vm-01:~$ sudo lxc launch ubuntu:14.04 dev
Creating dev
Starting dev

One thing I see in the juju logs is this being repeated:

Please install LXD by running:
        $ sudo apt-get install lxd
and then configure it with:
        $ newgrp lxd
        $ lxd init

2016-09-09 10:38:25 ERROR juju.provisioner container_initialisation.go:106 starting container provisioner for lxd: setting up container dependencies on host machine: can't connect to the local LXD server: LXD socket not found; is LXD installed & running?

Please install LXD by running:
        $ sudo apt-get install lxd
and then configure it with:
        $ newgrp lxd
        $ lxd init

2016-09-09 10:38:25 ERROR juju.worker runner.go:210 exited "2-container-watcher": worker "2-container-watcher" exited: setting up container dependencies on host machine: can't connect to the local LXD server: LXD socket not found; is LXD installed & running?

Please install LXD by running:
        $ sudo apt-get install lxd
and then configure it with:
        $ newgrp lxd
        $ lxd init

Changed in juju:
status: Incomplete → New
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta18 → 2.0-beta19
Changed in juju:
milestone: 2.0-beta19 → 2.0-rc1
Changed in juju:
status: New → Triaged
Revision history for this message
Larry Michel (lmic) wrote :

This looks fixed in beta18. I tested with a PowerNV baremetal systems and the containers came up OK.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

This is great news \o/
Thank you, Larry :D

Changed in juju:
status: Triaged → Fix Released
milestone: 2.0-rc1 → none
milestone: none → 2.0-rc1
Revision history for this message
Larry Michel (lmic) wrote :
Download full text (4.8 KiB)

We recreated this with beta18 on ppc64el so setting this back to new. Going back to my previous test, I now see that something was wrong with the bundle and I missed that the lxd container placement line had been removed.

machines:
  "0":
    juju-status:
      current: started
      since: 21 Sep 2016 10:39:49Z
      version: 2.0-beta18
    dns-name: 10.245.0.205
    instance-id: 4y3khr
    machine-status:
      current: running
      message: Deployed
      since: 21 Sep 2016 10:38:52Z
    series: trusty
    containers:
      0/lxd/0:
        juju-status:
          current: started
          since: 21 Sep 2016 10:41:18Z
          version: 2.0-beta18
        dns-name: 10.245.0.217
        instance-id: juju-cecd17-0-lxd-0
        machine-status:
          current: running
          message: Container started
          since: 21 Sep 2016 10:40:45Z
        series: trusty
      0/lxd/1:
        juju-status:
          current: started
          since: 21 Sep 2016 10:41:43Z
          version: 2.0-beta18
        dns-name: 10.245.1.55
        instance-id: juju-cecd17-0-lxd-1
        machine-status:
          current: running
          message: Container started
          since: 21 Sep 2016 10:41:09Z
        series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=hardware-hp-proliant-DL320E,anahuac,hwe-x,hw-staging-xenial
      availability-zone=default
  "1":
    juju-status:
      current: started
      since: 21 Sep 2016 10:43:14Z
      version: 2.0-beta18
    dns-name: 10.245.0.180
    instance-id: 4y3xaw
    machine-status:
      current: running
      message: Deployed
      since: 21 Sep 2016 10:41:23Z
    series: trusty
    containers:
      1/lxd/0:
        juju-status:
          current: started
          since: 21 Sep 2016 10:45:05Z
          version: 2.0-beta18
        dns-name: 10.245.0.160
        instance-id: juju-cecd17-1-lxd-0
        machine-status:
          current: running
          message: Container started
          since: 21 Sep 2016 10:44:22Z
        series: trusty
      1/lxd/1:
        juju-status:
          current: started
          since: 21 Sep 2016 10:45:39Z
          version: 2.0-beta18
        dns-name: 10.245.1.16
        instance-id: juju-cecd17-1-lxd-1
        machine-status:
          current: running
          message: Container started
          since: 21 Sep 2016 10:44:56Z
        series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=16384M tags=hardware-dell-poweredge-R810,prunes,hw-staging-xenial
      availability-zone=default
  "2":
    juju-status:
      current: started
      since: 21 Sep 2016 10:37:48Z
      version: 2.0-beta18
    dns-name: 10.245.1.20
    instance-id: 4y3hdx
    machine-status:
      current: running
      message: Deployed
      since: 21 Sep 2016 10:36:55Z
    series: trusty
    containers:
      2/lxd/0:
        juju-status:
          current: pending
          since: 21 Sep 2016 10:32:05Z
        instance-id: pending
        machine-status:
          current: allocating
          message: Starting container
          since: 21 Sep 2016 10:46:29Z
        series: trusty
    hardware: arch=ppc64el cpu-cores=4 mem=8229M tags=hardware-ibm-power8-S822L,blacklist-compute...

Read more...

Changed in juju:
status: Fix Released → New
Changed in juju:
status: New → Triaged
milestone: 2.0-rc1 → 2.0-rc2
assignee: Mick Gregg (macgreagoir) → Richard Harding (rharding)
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Larry,

Thank you for verifying \o/

If you have a chance to try out rc1, it'd be great to get more recent logs.

Changed in juju:
milestone: 2.0-rc2 → 2.0.0
Revision history for this message
Larry Michel (lmic) wrote :

I've recreated with RC1 and am attaching the logs from node 0.

$ juju status
MODEL CONTROLLER CLOUD/REGION VERSION
default mycontroller larry 2.0-rc1

APP VERSION STATUS SCALE CHARM STORE REV OS NOTES
keystone waiting 0/1 keystone jujucharms 256 ubuntu
neutron-api waiting 0/1 neutron-api jujucharms 244 ubuntu
nova-cloud-controller blocked 1 nova-cloud-controller jujucharms 290 ubuntu

UNIT WORKLOAD AGENT MACHINE PUBLIC-ADDRESS PORTS MESSAGE
keystone/0 waiting allocating 0/lxd/0 waiting for machine
neutron-api/0 waiting allocating 0/lxd/1 waiting for machine
nova-cloud-controller/0 blocked idle 0 10.245.1.12 8774/tcp Missing relations: messaging, image, compute, database; incomplete relations: neutron-api, identity

MACHINE STATE DNS INS-ID SERIES AZ
0 started 10.245.1.12 4y3hmr trusty Production
0/lxd/0 pending pending trusty
0/lxd/1 pending pending trusty

RELATION PROVIDES CONSUMES TYPE
cluster keystone keystone peer
identity-service keystone neutron-api regular
identity-service keystone nova-cloud-controller regular
cluster neutron-api neutron-api peer
neutron-api neutron-api nova-cloud-controller regular
cluster nova-cloud-controller nova-cloud-controller peer

Revision history for this message
Larry Michel (lmic) wrote :

The latest recreate was with a PowerNV bare metal system.

Changed in juju:
assignee: Richard Harding (rharding) → Alexis Bruemmer (alexis-bruemmer)
tags: added: ateam
Revision history for this message
Larry Michel (lmic) wrote :
Download full text (8.3 KiB)

I did not recreate this with RC2. Last time there was something wrong with my so including the entire output here this time to make sure I didn't misread the results:

I logged in on the PowerNV system and I can see the containers running.

ubuntu@entei:~$ lxc list
Generating a client certificate. This may take a minute...
If this is your first time using LXD, you should also run: sudo lxd init
To start your first container, try: lxc launch ubuntu:16.04

Permission denied, are you in the lxd group?
ubuntu@entei:~$ sudo lxc list
+---------------------+---------+---------------------+------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------------+---------+---------------------+------+------------+-----------+
| juju-f3a098-1-lxd-0 | RUNNING | 10.245.0.224 (eth0) | | PERSISTENT | 0 |
+---------------------+---------+---------------------+------+------------+-----------+
| juju-f3a098-1-lxd-1 | RUNNING | 10.245.0.189 (eth0) | | PERSISTENT | 0 |
+---------------------+---------+---------------------+------+------------+-----------+
ubuntu@entei:~$ uname -a
Linux entei 4.4.0-38-generic #57~14.04.1-Ubuntu SMP Tue Sep 6 17:19:10 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@entei:~$

and output of juju status:

jenkins@lmic-s9-instance:~/ppc64el$ juju status
MODEL CONTROLLER CLOUD/REGION VERSION
default mycontroller larry 2.0-rc2

APP VERSION STATUS SCALE CHARM STORE REV OS NOTES
ceph active 1 ceph jujucharms 265 ubuntu
cinder active 1 cinder jujucharms 255 ubuntu
glance active 1 glance jujucharms 251 ubuntu
keystone active 1 keystone jujucharms 256 ubuntu
mysql active 1 percona-cluster jujucharms 244 ubuntu
neutron-api active 1 neutron-api jujucharms 244 ubuntu
neutron-gateway active 1 neutron-gateway jujucharms 230 ubuntu
neutron-openvswitch active 2 neutron-openvswitch jujucharms 236 ubuntu
nova-cloud-controller waiting 1 nova-cloud-controller jujucharms 290 ubuntu
nova-compute active 2 nova-compute jujucharms 257 ubuntu
openstack-dashboard active 1 openstack-dashboard jujucharms 241 ubuntu
rabbitmq-server active 1 rabbitmq-server jujucharms 50 ubuntu

UNIT WORKLOAD AGENT MACHINE PUBLIC-ADDRESS PORTS MESSAGE
ceph/0 active idle 0 10.245.0.212 Unit is ready and clustered
cinder/0 active idle 3 10.245.0.181 8776/tcp Unit is ready
glance/0 active idle 0/lxd/0 10.245.0.223 9292/tcp Unit is ready
keystone/0 active idle ...

Read more...

Changed in juju:
status: Triaged → Invalid
status: Invalid → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.