intermittent: failed to retrieve the template to clone: template container juju-trusty-lxc-template did not stop
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
Medium
|
Unassigned | ||
Bug Description
NOTE: this is an intermittent failure. The log file in comment #1 shows that there are many other LXC container that do start properly on other host machines.
This is similar to bug #1348386 but that one is fixed, and have been seeing for 1.22.
This happens for OpenStack deployment:
+ . ./pipeline_
++ export OPENSTACK_
++ OPENSTACK_
++ export COMPUTE=nova-kvm
++ COMPUTE=nova-kvm
++ export BLOCK_STORAGE=
++ BLOCK_STORAGE=
++ export IMAGE_STORAGE=
++ IMAGE_STORAGE=
++ export PIPELINE_
++ PIPELINE_
++ export NETWORKING=
++ NETWORKING=
++ export UBUNTU_
++ UBUNTU_
From juju-debug log file:
machine-4[3721]: 2015-04-05 05:32:17 INFO juju.container.lxc clonetemplate.
machine-4[3721]: 2015-04-05 05:37:42 INFO juju.container.lxc clonetemplate.
machine-4[3721]: 2015-04-05 05:37:42 INFO juju.container lock.go:66 release lock "juju-trusty-
machine-4[3721]: 2015-04-05 05:37:42 ERROR juju.provisione
machine-4[3721]: 2015-04-05 05:37:42 ERROR juju.provisioner provisioner_
From juju_status.yaml:
'4':
agent-state: started
agent-version: 1.22.0
containers:
4/lxc/0:
series: trusty
4/lxc/1:
series: trusty
| Larry Michel (lmic) wrote : | #1 |
| tags: | added: lxc |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → Medium |
| Changed in juju-core: | |
| importance: | Medium → High |
| milestone: | none → 1.24-alpha1 |
| tags: | added: ci test-failure |
| tags: | added: vivid |
| Changed in juju-core: | |
| assignee: | nobody → Katherine Cox-Buday (cox-katherine-e) |
Some various interesting bits from the log:
lxc-start 1426805367.372 WARN lxc_confile - confile.
lxc-start 1426805367.373 WARN lxc_log - log.c:lxc_
lxc-start 1426805367.376 WARN lxc_cgmanager - cgmanager.
...
lxc-start 1426805367.662 ERROR lxc_apparmor - lsm/apparmor.
lxc-start 1426805367.662 ERROR lxc_sync - sync.c:
lxc-start 1426805367.662 ERROR lxc_start - start.c:
lxc-start 1426805367.663 ERROR lxc_cgmanager - cgmanager.
lxc-start 1426805367.663 ERROR lxc_cgmanager - cgmanager.
lxc-start 1426805367.700 WARN lxc_commands - commands.
lxc-start 1426805367.701 WARN lxc_cgmanager - cgmanager.
lxc-start 1426805372.706 ERROR lxc_start_ui - lxc_start.
lxc-start 1426805372.706 ERROR lxc_start_ui - lxc_start.
lxc-start 1426805372.706 ERROR lxc_start_ui - lxc_start.
lxc-start 1426806676.505 INFO lxc_start_ui - lxc_start.
lxc-start 1426806676.505 WARN lxc_confile - confile.
lxc-start 1426806676.506 WARN lxc_log - log.c:lxc_
lxc-start 1426806676.508 WARN lxc_cgmanager - cgmanager.
Discussions with app armor/LXC experts lead us to believe that this is a possible race issue in Juju: i.e., possibly apt-get install lxc is not yet complete by the time we attempt to utilize lxc commands. Clues include the fact that changing apparmor profiles fail several times but eventually succeed. We believe there is a secondary issue which is causing the spam at the tail of the log (peer has disconnected), but the thought is solving the first issue might solve the secondary issue, or at least make it more clear what's happening.
Further investigation is needed.
| Tim Penhey (thumper) wrote : | #6 |
Curtis, can you please file a different bug for vivid. I'm 99% certain that it is a different cause.
The way we get the template container to stop is to add an upstart job to shutdown the machine. Since vivid is using systemd, we need a more robust solution there.
The errors shown above are trusty, and different.
| summary: |
- failed to retrieve the template to clone: template container juju- - trusty-lxc-template did not stop + intermittent: failed to retrieve the template to clone: template + container juju-trusty-lxc-template did not stop |
| description: | updated |
| Tim Penhey (thumper) wrote : | #7 |
Hey Larry,
Can we gather extra logging information from this environment? Or has it been torn down?
If you can, we'd love everything from /var/lib/
Also /var/log/
| Tim Penhey (thumper) wrote : | #8 |
The source of this problem is almost certainly a race condition on the host machine.
In order to reduce the number of packages we install by default on the cloud instances, the container packages are installed in a "just in time" manner. It seems that it isn't quite in time, or more precisely, the packages are installed, but some of the other components that are needed haven't got themselves into a stable state before we try to use them in anger.
What we probably want to do is to have some form of 'ready check' that we can do after the packages are installed before we try and create the template container. We just don't know that this should be yet.
| Changed in juju-core: | |
| status: | Triaged → Incomplete |
| tags: | removed: ci test-failure vivid |
| no longer affects: | juju-core/1.23 |
| Curtis Hovey (sinzui) wrote : | #9 |
As of Commit 2e07936c in 1.23, the aws-deployer-bundle test fails like this bug report
machines:
"0":
agent-state: started
agent-version: 1.23-beta4
dns-name: 52.5.226.249
instance-id: i-470c67bb
instance-state: running
series: trusty
hardware: arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-
state-
"1":
agent-state: started
agent-version: 1.23-beta4
dns-name: 52.4.228.68
instance-id: i-b40c5a63
instance-state: running
series: trusty
hardware: arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-
"2":
agent-state: started
agent-version: 1.23-beta4
dns-name: 52.5.201.139
instance-id: i-e9983b14
instance-state: running
series: trusty
containers:
2/lxc/0:
series: trusty
2/lxc/1:
series: trusty
hardware: arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-
I captured the container log from the machine before it was destroyed.
| tags: | added: ci deployer regression |
| Changed in juju-core: | |
| status: | Incomplete → Triaged |
| importance: | High → Critical |
| importance: | Critical → High |
| Curtis Hovey (sinzui) wrote : | #10 |
I removed the "ci" tag from this bug because the commit that caused this was reverting a feature. We don't want to block the secondary fixes that this branch needs.
| tags: | removed: ci |
| Curtis Hovey (sinzui) wrote : | #11 |
Using the bundle found at
http://
you can run the same test as CI
juju bootstrap
juju --show-log deployer --debug --deploy-delay 10 --config bundles.yaml
This works on 1.22.1 and 1.23-beta4 (cut form cherylj's commit).
Starting with commit 2e07936c (which thinks it is 1.23-beta4, but is not) deployments with containers in aws will fail.
For the record, apparmor did change on the machine last week: 2.8.95~
However, this certainly lends credence that commit 2e07936c is suspect since there is a version of v1.23 that works. Curtis also checked to ensure the AWS mirrors were not stale.
| no longer affects: | juju-core/1.23 |
| Changed in juju-core: | |
| importance: | High → Medium |
| assignee: | Katherine Cox-Buday (cox-katherine-e) → nobody |
| Changed in juju-core: | |
| milestone: | 1.24-alpha1 → none |
| tags: | added: systemd |
| tags: | added: upstart |
| Larry Michel (lmic) wrote : | #13 |
We hit this with twice yesterday:
'5':
agent-state: started
agent-version: 1.23.2
containers:
5/lxc/0:
series: trusty
5/lxc/1:
series: trusty
dns-name: hayward-16.oil
hardware: arch=amd64 cpu-cores=8 mem=16384M tags=debug,
instance-id: /MAAS/api/
series: trusty
| Alvaro Uría (aluria) wrote : | #14 |
Hello,
This also happens on juju 1.22.6.1 (current stable pkg in Ubuntu Trusty - 1.22.6-
"""
machines:
"0":
agent-state: started
agent-version: 1.22.6.1
dns-name: os-1.maas
instance-id: /MAAS/api/
series: trusty
containers:
0/lxc/0:
series: trusty
"""
I even tried to add "apt-get install lxc -qy" on d-i late_commands stage, with the same effect.
Cheers,
-Alvaro.
| tags: | added: canonical-bootstack |
| Cheryl Jennings (cherylj) wrote : | #16 |
At this point, we'll need the console log from the template container to figure out what's going on. Can you attach the contents of /var/lib/
| Alvaro Uría (aluria) wrote : | #17 |
Hello Cheryl,
I don't have that information now, but I will try to gather it on Monday.
Cheers,
-Alvaro.
| Jill Rouleau (jillrouleau) wrote : | #18 |
Cheryl,
Ran into this in a different environment, using fastpath intaller.
containers:
0/lxc/0:
failed to retrieve the template to clone: cannot determine cached image URL: cannot determine LXC image URL: cannot determine LXC image URL: failed to get https:/
: exit status 1: cannot determine LXC image URL: failed to get https:/
: exit status 1
series: trusty
Requested logs are attached, I also saved off /var/log in case you need machine logs or anything else.
Thanks.
| Jill Rouleau (jillrouleau) wrote : | #19 |
| Jill Rouleau (jillrouleau) wrote : | #20 |
| Jill Rouleau (jillrouleau) wrote : | #21 |
| Cheryl Jennings (cherylj) wrote : | #22 |
Hi Jill, the problem that you ran into in seq #18 is different than the original issue. From the logs you attached, I can see that the template container did stop and was able to be cloned. There may have been a temporary outage for cloud-images.
| Alvaro Uría (aluria) wrote : | #23 |
I think issue on #18 was related to https:/
| Alvaro Uría (aluria) wrote : | #24 |
Hello Cheryl,
Please find attached /var/lib/
Find also attached "juju status" output just after this error.
Please let me know if you would need more information.
Kind regards,
-Alvaro.
| Alvaro Uría (aluria) wrote : | #25 |
juju status output after 0/lxc/0 error
| Cheryl Jennings (cherylj) wrote : | #26 |
Thanks, Alvaro. Taking a look now.
| Jill Rouleau (jillrouleau) wrote : | #27 |
Hi Cheryl,
I'm travelling this week so I have not, I'll see what we come up with in this other environment when I get back. Thanks!
| Cheryl Jennings (cherylj) wrote : | #28 |
Looked at the logs more and things appear hung while doing an apt-get upgrade during cloud-init. The good news is that we don't see those lxc errors and warnings. The bad news is that now we need to figure out what cloud-init is doing, and this requires pulling the cloud-init log which is located in the template container at /var/log/
I do have a few other questions to help piece together what may be going on:
1 - if you run lxc-ls --fancy on the machine hosting the lxc containers, do you see the container named juju-trusty-
2 - are there any other issues on the machine which could stall progress such as a disk full, or networking issues?
| Alvaro Uría (aluria) wrote : | #29 |
Hello Cheryl,
I would need to redeploy to get /var/log/
"""
Get:43 http://
Get:44 http://
"""
With regard to your questions:
1.- "lxc-ls --fancy" shows juju-trusty-
2.- I haven't seen any further issue on machine 0. With regard to networking, I haven't tested manually but lxc.conf as well as /e/n/interfaces files show a correct MTU setting. Besides, deploys with juju 1.20.14 have worked all 3 times tried while deploys with 1.22.6 have not worked any of the 8-9 times tried.
Cheers,
-Alvaro.
| Cheryl Jennings (cherylj) wrote : | #30 |
If you have access to a system where 1.20.14 successfully created the template, please send the contents of /var/lib/
Previously, it was mentioned that this bug was intermittent. Is it the case now where you can't deploy to a container at all with 1.22.6?
| Changed in juju-core: | |
| assignee: | nobody → Cheryl Jennings (cherylj) |
| Alvaro Uría (aluria) wrote : | #31 |
Hello Cheryl,
Please find attached /var/lib/
OTOH, juju 1.22.6 has failed all the times in this specific environment: HA + debian-installer (running mdadm on top of disks). Curtin can't be used as SW RAID1 is not supported yet.
However, 1.22.6 has worked fine on a Staging HA environment, using Curtin (and in another environment we have used it, as well... ha+curtin).
Please let me know if I can help you more.
Cheers,
-Alvaro.
| Cheryl Jennings (cherylj) wrote : | #32 |
I've asked smoser for some assistance with this bug.
| Jason Hobbs (jason-hobbs) wrote : | #33 |
We're still hitting this on 1.24.5.
| Cheryl Jennings (cherylj) wrote : | #34 |
Jason, if you're running into it again, could you grab the cloud init logs off of the container? They'll be in:
/var/log/
/var/log/
on the juju-trusty-
| Matt Rae (mattrae) wrote : | #35 |
In my case this issue appears related to the mtu set on the containers. i'm using juju 1.24.5
When the physical network mtu is 1500 we need to decrease the instance mtu due to the additional GRE header added when using a neutron network encapsulated with GRE.
we can change the default instance mtu to 1454 with ' juju set neutron-gateway instance-mtu=1454'
This makes the openstack instance mtu is 1454 but the juju-trusty-
can we set the default mtu for containers created by juju?
| Matt Rae (mattrae) wrote : | #37 |
adding 'lxc-default-mtu: 1454' to my .juju/environme
I found 'lxc-default-mtu' from this bug https:/
| Dimiter Naydenov (dimitern) wrote : | #38 |
Since Matt reports the issue is no longer happening when lxc-default-mtu is set, should we close this?
| Scott Moser (smoser) wrote : | #39 |
should the user be expected to set such a thing or risk some arbitrary network related failure?
Seems like *something* should be fixed here.
| Jason Hobbs (jason-hobbs) wrote : | #40 |
This should certainly work by default.
| Dimiter Naydenov (dimitern) wrote : | #41 |
Jason, are you saying juju should *by default* set the MTU for every NIC configured for each LXC container to 1454 ?
Sorry, but I disagree - that might work for this specific setup with GRE tunnels, but other stakeholders have different setups - we were asked to even set it to 9000 by default. Juju provides a setting to allow you to explicitly set MTU for any LXC NICs, exactly for these cases. The original solution I had implemented did discover the host NIC's MTU and used that for the corresponding LXC NIC. That was deemed too "magical" and not helpful in a lot of cases, esp. with corosync in play. So now there's lxc-default-mtu instead.
| tags: | added: cisco landscape |
| Tom Haddon (mthaddon) wrote : | #42 |
Perhaps at a minimum we could have a better error message to explain what's failing and what the likely fix is.
| Antoni Segura Puimedon (celebdor) wrote : | #43 |
I reproduced with juju 1.25.0 on a trusty machine deploying a precise lxc container:
2015-11-26 12:14:58 ERROR juju.provisione
2015-11-26 12:14:58 ERROR juju.provisioner provisioner_
2015-11-26 12:53:54 ERROR juju.provisione
2015-11-26 12:54:04 ERROR juju.provisione
2015-11-26 12:54:14 ERROR juju.provisione
2015-11-26 12:54:24 ERROR juju.provisione
2015-11-26 12:54:24 ERROR juju.provisioner provisioner_
Any idea on a workaround or fix?
| Antoni Segura Puimedon (celebdor) wrote : | #44 |
As requested above:
ubuntu@
==
Checking for running unattended-
acpid: exiting
TERM environment variable not set.
<4>init: tty4 main process (350) killed by TERM signal
<4>init: tty2 main process (367) killed by TERM signal
<4>init: tty3 main process (369) killed by TERM signal
Thu Nov 26 13:46:39 UTC 2015: shutting down for shutdown-unknown [up 6609s].
<4>init: cron main process (377) killed by TERM signal
<4>init: irqbalance main process (387) killed by TERM signal
<4>init: console main process (413) killed by TERM signal
<4>init: tty1 main process (417) killed by TERM signal
<4>init: hwclock-save main process (582) terminated with status 70
<4>init: plymouth-
* Stopping landscape-client daemon
...fail!
* Asking all remaining processes to terminate...
...done.
* All processes ended within 1 seconds....
...done.
initctl: Event failed
* Deactivating swap...
...fail!
mount: cannot mount block device LABEL=cloudimg-
* Will now halt
=======
ubuntu@
==
lxc-start 1448545270.029 DEBUG lxc_commands - commands.
lxc-start 1448545270.029 DEBUG lxc_commands - commands.
lxc-start 1448545270.031 DEBUG lxc_commands - commands.
lxc-start 1448545270.032 DEBUG lxc_commands - commands.
lxc-start 1448545270.033 DEBUG lxc_commands - commands.
lxc-start 1448545270.034 DEBUG lxc_commands - commands.
lxc-start 1448545270.035 DEBUG lxc_commands - commands.
lxc-start 1448545270.035 DEBUG lxc_commands - commands.
lxc-start 1448545270.036 DEBUG lxc_commands - commands.
lxc-start 1448545270.038 DEBUG lxc_commands - commands.
lxc-start 1448545270.038 DEBUG lxc_commands - commands.
lxc-start 1448545270.054 DEBUG lxc_commands - commands.
lxc-start 1448545270.056 DEBUG lxc_commands - commands.
lxc-start 1448545270.057 DEBUG lxc_commands - commands.
lxc-start 1448545270.059 DEBUG lxc_commands - commands.
lxc-start 1448545270.060 DEBUG lxc_commands - commands.
lxc-start 1448545270.061 DEBUG lxc_c...
| Antoni Segura Puimedon (celebdor) wrote : | #45 |
After digging the whole afternoon I found out that the problem was a misconfiguration of the maas dhcp server. It only left space for 7 addresses. When they exhausted, all the other new machines/containers failed to deploy.
I recommend:
- using the maas api to detect subnet exhaustion and prevent launching machines/containers that can't be provisioned
- detecting failures due to the lxc container networking not getting up and reporting it meaningfully in the juju debug log and error message.
| Changed in juju-core: | |
| assignee: | Cheryl Jennings (cherylj) → nobody |
| description: | updated |
| description: | updated |
| Changed in juju-core: | |
| status: | Triaged → Invalid |
| tags: | removed: regression |


It looks like stderr from lxc-start was being dumped to "/var/lib/ juju/containers /juju-* -lxc-template/ container. log". With help from the CI team, we're now capturing these logs. Here is the offending log from the latest run which shows some issues I'm looking into further.