2.1beta4: LXDs getting stuck pending in cloud-init during Landscape Autopilot deployment

Bug #1655783 reported by Francis Ginther
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Autopilot Log Analyser
Fix Committed
High
Francis Ginther
Canonical Juju
Fix Released
High
Unassigned

Bug Description

I've found a few cases of this:

https://ci.lscape.net/job/landscape-system-tests/4940
https://ci.lscape.net/job/landscape-system-tests/4952
https://ci.lscape.net/job/landscape-system-tests/4963

The symptoms are an lxd unit being left in the 'pending' state despite the deployment being several hours old (the deployment times out after 4 hours). The process listing also shows the lxc process still cloud-init (see https://pastebin.canonical.com/175702/ for better formatting):

[from ps-fauxww.txt]
root 5617 0.0 0.0 73384 4024 ? Ss 14:44 0:00 [lxc monitor] /var/lib/lxd/containers juju-8a4dbf
-1-lxd-1
100000 5655 0.0 0.0 37556 5560 ? Ss 14:44 0:01 \_ /sbin/init
100000 5870 0.0 0.0 41720 3240 ? Ss 14:44 0:00 \_ /lib/systemd/systemd-udevd
100000 5879 0.0 0.0 52052 14920 ? Ss 14:44 0:02 \_ /lib/systemd/systemd-journald
100000 6231 0.0 0.0 65520 5868 ? Ss 14:44 0:00 \_ /usr/sbin/sshd -D
100001 6237 0.0 0.0 26044 2192 ? Ss 14:44 0:00 \_ /usr/sbin/atd -f
100107 6243 0.0 0.0 42892 3972 ? Ss 14:44 0:00 \_ /usr/bin/dbus-daemon --system --address=s
ystemd: --nofork --nopidfile --systemd-activation
100000 6249 0.0 0.0 209044 12548 ? Ssl 14:44 0:00 \_ /usr/lib/snapd/snapd
100000 6269 0.0 0.0 26068 2560 ? Ss 14:44 0:00 \_ /usr/sbin/cron -f
100000 6274 0.0 0.0 272940 5912 ? Ssl 14:44 0:00 \_ /usr/lib/accountsservice/accounts-daemon
100000 6277 0.0 0.0 20100 1176 ? Ss 14:44 0:00 \_ /lib/systemd/systemd-logind
100104 6282 0.0 0.0 186900 3304 ? Ssl 14:44 0:00 \_ /usr/sbin/rsyslogd -n
100000 6314 0.0 0.0 277180 6136 ? Ssl 14:44 0:00 \_ /usr/lib/policykit-1/polkitd --no-debug
100000 6357 0.0 0.0 12844 1776 pts/1 Ss+ 14:44 0:00 \_ /sbin/agetty --noclear --keep-baud consol
e 115200 38400 9600 linux
100000 6724 0.0 0.1 93124 30556 ? Ss 14:44 0:00 \_ /usr/bin/python3 /usr/bin/cloud-init modules --mode=final
100000 6882 0.0 0.0 4508 800 ? S 14:44 0:00 \_ /bin/sh -c tee -a /var/log/cloud-init-output.log
100000 6883 0.0 0.0 4384 764 ? S 14:44 0:00 | \_ tee -a /var/log/cloud-init-output.log
100000 6998 0.0 0.0 4508 1668 ? S 14:44 0:00 \_ /bin/sh /var/lib/cloud/instance/scripts/runcmd
100000 713392 0.0 0.0 4380 652 ? S 18:16 0:00 \_ sleep 15

Next step is to catch this on a live system and extract the cloud-init logs from the host. Unfortunately, these are not currently saved in the log dump.

Revision history for this message
Francis Ginther (fginther) wrote :
information type: Proprietary → Public
Revision history for this message
Simon Poirier (simpoir) wrote :

I've hit this a few times also. In all cases it was only the first LXD container to spawn (first time-wise not necessarily container number). All containers spawning seconds later started ok.

It looks like containers are launched before the network is fully setup.
I'm attaching cloud-init logs for the last failure I've hit.

Revision history for this message
Simon Poirier (simpoir) wrote :
Revision history for this message
Simon Poirier (simpoir) wrote : Re: LXDs getting stuck pending in cloud-init during Landscape Autopilot deployment
summary: - LXCs getting stuck in cloud-init during Landscape Autopilot deployment
+ LXDs getting stuck pending in cloud-init during Landscape Autopilot
+ deployment
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Just had another case of this: http://ci.lscape.net/job/landscape-system-tests/4996/

The first container on a new machine is stuck, the rest started just fine:
...
  '2':
    containers:
      2/lxd/0:
        instance-id: juju-bbf56e-2-lxd-0
        juju-status: {current: pending, since: '13 Jan 2017 03:01:33Z'}
        machine-status: {current: running, message: Container started, since: '13
            Jan 2017 03:03:05Z'}
        series: xenial
      2/lxd/1:
        dns-name: 10.96.109.192
        instance-id: juju-bbf56e-2-lxd-1
        ip-addresses: [10.96.109.192]
        juju-status: {current: started, since: '13 Jan 2017 03:05:07Z', version: 2.1-beta4}
        machine-status: {current: running, message: Container started, since: '13
            Jan 2017 03:03:48Z'}
        series: xenial
...

I have seen this many months ago in juju, probably even juju1. If I find a reference to that, I'll attach it here.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I don't have logs from the host because of bug https://bugs.launchpad.net/juju/+bug/1646329. Our log collector script failed because it uses juju ssh/scp and juju gave it the wrong ip to connect to.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.2.0
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Interestingly, cloud-init-output.log, attached in comment # 3, is reporting these failures:

Something wicked happened resolving '10.5.0.10:3142' (-9 - Address family for hostname not supported)
Err:12 http://security.ubuntu.com/ubuntu xenial-security/main amd64 libnspr4 amd64 2:4.12-0ubuntu0.16.04.1

Changed in landscape:
milestone: none → 17.01
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

@anastasia-macmood, this is because the container didn't get networking setup. For some reason, this container has no static IP entry in /etc/network/interfaces or interfaces.d/*. I'm triaging another failed run now.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

We have another fresh run where this happened:
http://ci.lscape.net/job/landscape-system-tests/5040/

I'm going to add some log files, and to not make this confusing, all logs related to this run (5040) will have that number in their filenames.

First, here is an interesting lxc list output on the hendel node, showing only one container without an IP. Note also it's not the first one:

http://pastebin.ubuntu.com/23823808/

(width trimmed for better readability)
+---------------------+---------+--------------------+
| NAME | STATE | IPV4 |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-0 | RUNNING | 10.96.24.66 (eth0) |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-1 | RUNNING | |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-2 | RUNNING | 10.96.24.58 (eth0) |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-3 | RUNNING | 10.96.24.53 (eth0) |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-4 | RUNNING | 10.96.24.68 (eth0) |
+---------------------+---------+--------------------+
| juju-0f41e5-1-lxd-5 | RUNNING | 10.96.24.63 (eth0) |
+---------------------+---------+--------------------+

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I think this 1-lxd-1 container became node gweyff in MAAS:

Jan 18 12:49:01 atlas maas.api: [info] juju-0f41e5-1-lxd-1: Added new device (parent: hendel)
Jan 18 12:49:01 atlas sh[6616]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "GET /MAAS/api/2.0/version/ HTTP/1.1" 20
0 180 "-" "Go-http-client/1.1"
Jan 18 12:49:01 atlas sh[6611]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "POST /MAAS/api/2.0/devices/?op= HTTP/1.
1" 200 506 "-" "Go-http-client/1.1"
Jan 18 12:49:01 atlas sh[6616]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "PUT /MAAS/api/2.0/nodes/gweyff/interfaces/214788/ HTTP/1.1" 200 242 "-" "Go-http-client/1.1"

And I can only see these two API calls made from juju to MAAS about it:
Jan 18 12:49:01 atlas sh[6616]: 2017-01-18 12:49:01 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:01 +0000] "PUT /MAAS/api/2.0/nodes/gweyff/interfaces/214788/ HTTP/1.1" 200 242 "-" "Go-http-client/1.1"
Jan 18 12:49:02 atlas sh[6611]: 2017-01-18 12:49:02 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:02 +0000] "GET /MAAS/api/2.0/devices/?id=gweyff HTTP/1.1" 200 519 "-" "Go-http-client/1.1"

What is missing, and it exists for the other containers, is a op=link_subnet POST call, like for this other container called qesn6m for example:
Jan 18 12:49:08 atlas sh[6616]: 2017-01-18 12:49:08 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:08 +0000] "PUT /MAAS/api/2.0/nodes/qesn6m/interfaces/214789/ HTTP/1.1" 200 243 "-" "Go-http-client/1.1"
Jan 18 12:49:15 atlas sh[6611]: 2017-01-18 12:49:15 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:15 +0000] "POST /MAAS/api/2.0/nodes/qesn6m/interfaces/214789/?op=link_subnet HTTP/1.1" 200 561 "-" "Go-http-client/1.1"
Jan 18 12:49:15 atlas sh[6611]: 2017-01-18 12:49:15 twisted.python.log: [info] ::ffff:127.0.0.1 - - [18/Jan/2017:12:49:15 +0000] "GET /MAAS/api/2.0/devices/?id=qesn6m HTTP/1.1" 200 860 "-" "Go-http-client/1.1"

I don't know why juju wouldn't make this link_subnet call. Maybe it didn't like the response to the first PUT call?

I'm attaching this syslog log.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

/etc/* and /var/log/* from inside of the stuck container 1/lxd/1

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

/etc/* and /var/log/* from the host (hendel) of the stuck 1/lxd/1 container. Note there is a base-machine-4/var/log/ps-fauxww.txt file with the process listing generated at log collection time.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

/etc/* and /var/log/* from the controller node. This also has a ps file in the same location with the process listing.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

/var/log/syslog on the MAAS node, with just the entries that cover this failed #5040 deployment. Note this is a shared maas server and it will have log entries about other things happening to this node, not just this failed deployment.

I also have regiond.log and maas.log, but I thought this syslog bit was better because it has all messages intertwined.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

juju status, for completeness.

Chad Smith (chad.smith)
summary: - LXDs getting stuck pending in cloud-init during Landscape Autopilot
- deployment
+ 2.1beta4: LXDs getting stuck pending in cloud-init during Landscape
+ Autopilot deployment
Chad Smith (chad.smith)
Changed in landscape:
status: New → Confirmed
Changed in landscape:
importance: Undecided → High
assignee: nobody → Francis Ginther (fginther)
Changed in autopilot-log-analyser:
status: New → In Progress
importance: Undecided → High
Changed in autopilot-log-analyser:
status: In Progress → Fix Committed
Chris Gregan (cgregan)
tags: added: cdo-qa-blocker
John A Meinel (jameinel)
Changed in juju:
milestone: 2.2.0 → 2.1.0
Revision history for this message
John A Meinel (jameinel) wrote :

Juju <=2.1b4 all have an issue where if there is an error we'll end up setting it in a place (agent status) that ends up getting overridden with "agent is not running" message. I have a patch in the 2.1 branch already that at least moves the error message into machine-status instead of agent status. These aren't in tabular status by default, but you can do "juju status --format=yaml".

Revision history for this message
John A Meinel (jameinel) wrote :

"I don't know why juju wouldn't make this link_subnet call. Maybe it didn't like the response to the first PUT call?" is likely that Juju ran into some other error before it got to that point, which leaves us trying to infer it from the log files.

Is it possible to do a run at "DEBUG" level instead of WARNING? I see this:
2017-01-18 12:48:12 DEBUG juju.worker.logger logger.go:50 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"

And unfortunately, the only warnings we get are:
2017-01-18 12:48:59 WARNING juju.provisioner broker.go:97 incomplete DNS config found, discovering host's DNS config

Which are indications that it is trying to provision the instances (it would report it 1 time for each container), but clearly whatever actual failure is going on isn't being done at WARNING level.

or if full DEBUG is too verbose, we could probably do:
logging-config="<root>=WARNING;unit=DEBUG;juju.provisioner=DEBUG"

would at least help.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This bug isn't happening frequently enough that we can have all our deployments done with that extra debugging. The extra debugging increases load, makes the deployment take longer and also makes mongo use up more memory. I was hoping that code inspection around the area that makes the link_subnet call to MAAS could yield some results.

Chad Smith (chad.smith)
Changed in landscape:
milestone: 17.01 → 17.02
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Could you please re-try with Juju 2.1-rc2?

Changed in juju:
status: Triaged → Incomplete
milestone: 2.1-rc2 → none
Revision history for this message
Chad Smith (chad.smith) wrote :

Have upgraded juju2.1.1 for deployments and haven't seen this in a while. Will reopen if we see this again.

Changed in landscape:
status: Confirmed → Incomplete
Chad Smith (chad.smith)
Changed in landscape:
milestone: 17.02 → 17.03
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking this as Fix Committed in 2.1.1 based on comment # 20.

Any further failures in this area should have a separate report with newer logs.

Changed in juju:
status: Incomplete → Fix Released
milestone: none → 2.1.1
David Britton (dpb)
no longer affects: landscape
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.