Recurring lxc issue: failed to retrieve the template to clone

Bug #1488245 reported by Peter Sabaini
44
This bug affects 7 people
Affects Status Importance Assigned to Milestone
juju-core
Won't Fix
Undecided
Unassigned

Bug Description

When deploying Openstack with juju-core 1.22.6 I get this from time to time:

   containers:
      0/lxc/0:
        agent-state-info: |-
          failed to retrieve the template to clone: cannot determine cached image URL: cannot determine LXC image URL: cannot determine LXC image URL: fai
led to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt
          : exit status 1: cannot determine LXC image URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt
          : exit status 1
        instance-id: pending
        series: trusty
      0/lxc/1:
        agent-state: started
        agent-version: 1.22.6.1
        dns-name: 10.102.5.57
        instance-id: juju-machine-0-lxc-1
        series: trusty
        hardware: arch=amd64

That is, 0/lxc/0 can't be created, other containers can. This is not deterministic: on the same hardware, an hour earlier, all containers could be created. However, if it fails, it always seems to fail on 0/lxc/0.

I've put some logs up at chinstrap:/home/sabaini/juju-failed-to-retrieve-template-logs/

Revision history for this message
Cheryl Jennings (cherylj) wrote :

It would make sense that this would always fail for the first container, as what's failing is the download of the image to use in creating the template container.

Other containers which are started will see that the template doesn't exist, and try to create it. The error, specifically, is from Juju using the ubuntu-cloudimg-query command to get the url of the correct image to download. I wonder if there's some flakyness with cloud-images.ubuntu.com. I'll ask the CPC team about that.

It could be helpful to retry the ubuntu-cloudimg-query command if it fails, with some modest wait time.

Curtis Hovey (sinzui)
tags: added: lxc
Changed in juju-core:
status: New → Incomplete
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

I've hit this bug repeatedly today. What can I do to get some traction on this?

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status: Incomplete → Expired
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

We got such a case reported today. It consistently fails with this error on 32 core nodes, but works on 8 core ones.

What logs would you like to have?

Changed in juju-core:
status: Expired → Confirmed
tags: added: kanban-cross-team
Changed in juju-core:
status: Confirmed → New
tags: removed: kanban-cross-team
tags: added: kanban-crossoteam
tags: added: kanban-crossteam
removed: kanban-crossoteam
tags: removed: kanban-crossteam
tags: added: landscapee
tags: added: landscape
removed: landscapee
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Download full text (3.3 KiB)

Here is a machine-0.log file from a juju environment with debug enabled where this failed.

This was the bootstrap + deploy command:
juju bootstrap --to <hostname> --debug 2>&1 | tee bootstrap.log; juju deploy cs:trusty/ubuntu --to lxc:0; for n in $(seq 1 10); do juju add-unit ubuntu --to lxc:0; done

The first two LXC failed to come up (ubuntu/0 and ubuntu/1). You can see in the log that juju tried to create the lxc template twice and failed. It finally worked on the third attempt:

First 0/lxc/0:
2016-05-17 21:19:45 INFO juju.provisioner.lxc lxc-broker.go:88 starting lxc container for machineId: 0/lxc/0
(...)
2016-05-17 21:19:45 INFO juju.container.lxc clonetemplate.go:32 wait for flock on juju-trusty-lxc-template
2016-05-17 21:19:45 INFO juju.container lock.go:50 acquire lock "juju-trusty-lxc-template", ensure clone exists
2016-05-17 21:19:45 INFO juju.container.lxc clonetemplate.go:77 template does not exist, creating
(...)

Now 0/lxc/1:
2016-05-17 21:20:06 INFO juju.provisioner.lxc lxc-broker.go:88 starting lxc container for machineId: 0/lxc/1
(...)
2016-05-17 21:20:06 INFO juju.container lock.go:66 release lock "juju-trusty-lxc-template"
2016-05-17 21:20:06 ERROR juju.provisioner.lxc lxc-broker.go:168 failed to start container: failed to retrieve the template to clone: cannot determine cached image URL: cannot determine LXC image URL: cannot determine LXC image URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt
: exit status 1: cannot determine LXC image URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt
: exit status 1
(...)
2016-05-17 21:20:06 INFO juju.container.lxc clonetemplate.go:32 wait for flock on juju-trusty-lxc-template
2016-05-17 21:20:06 INFO juju.container lock.go:50 acquire lock "juju-trusty-lxc-template", ensure clone exists
2016-05-17 21:20:06 INFO juju.container.lxc clonetemplate.go:77 template does not exist, creating
2016-05-17 21:20:06 DEBUG juju.service discovery.go:65 discovered init system "upstart" from series "trusty"
2016-05-17 21:20:16 INFO juju.container lock.go:66 release lock "juju-trusty-lxc-template"
2016-05-17 21:20:16 ERROR juju.provisioner.lxc lxc-broker.go:168 failed to start container: failed to retrieve the template to clone: cannot determine cached image URL: cannot determine LXC image URL: cannot determine LXC image URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt
: exit status 1: cannot determine LXC image URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt
: exit status 1
(...)

For 0/lxc/2 onwards it worked:
2016-05-17 21:20:17 INFO juju.provisioner.lxc lxc-broker.go:88 starting lxc container for machineId: 0/lxc/2
(...)
2016-05-17 21:20:17 INFO juju.container.lxc clonetemplate.go:32 wait for flock on juju-trusty-lxc-template
2016-05-17 21:20:17 INFO juju.container lock.go:50 acquire lock "juju-trusty-lxc-template", ensure clone exists
2016-05-17 21:20:17 INFO juju.container.lxc clonetemplate.go:77 template does not exist, creating
2016-05-17 21:20:17 DEBUG juju.service discovery.go:65 discovered init system "upstart" f...

Read more...

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0.0
milestone: 2.0.0 → 2.0-beta8
Changed in juju-core:
milestone: 2.0-beta8 → 2.0-beta9
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I haven't been able to reproduce this locally. I do wonder if adding in a sleep between the bootstrap and deploy would affect this behavior?

Revision history for this message
Chuck Wu (chwu) wrote :

Hi Cheryl,

I don't think adding a sleep in between bootstrap and deploy would help. We were consistently getting this issue regardless of which method of installation we use... autopilot, multi, JUJU bundle, or even JUJU deploy by hand.

The first two containers created were always hitting this condition. These machines/nodes were of 32 Core, 128GB RAM.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Could you run a test like:

juju bootstrap --to <hostname> --debug 2>&1 | tee bootstrap.log; juju ssh 0 "wget https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt"; juju deploy cs:trusty/ubuntu --to lxc:0; for n in $(seq 1 10); do juju add-unit ubuntu --to lxc:0; done

To verify that the released file can be read before trying to create containers?

Revision history for this message
Chuck Wu (chwu) wrote :

Hi Cheryl,

We did this exact same test last week, many times. Each time, it replicated the issue consistently. However, today I haven't been able to replicate the issue... Tried it on 3 different 32-core nodes... The only difference is that we have resolved the network issue where IPv6 was used and no reverse DNS lookup was available (Bug #1582770). Could the same fix also resolved this issue also?

Revision history for this message
Cheryl Jennings (cherylj) wrote :

I added a wget before creating the containers in the above test case. Did you previously try that in your test runs? Was the host able to access that URL before creating the containers?

Revision history for this message
Chuck Wu (chwu) wrote :

For yesterday's test, the wget were all successful.

Prior test from last week, I don't think we had the wget in the test. Here is the bug ticket.

https://bugs.launchpad.net/landscape/+bug/1581707

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Download full text (4.2 KiB)

Hi Chuck,

I enabled IPv6 here in my network and registered some nodes with a local MAAS server.

I was able to reproduce the hostname-without-domain issue we had before (bug #1581707) when IPv6 was enabled in your network:

root@15-89:~# python -c "import socket; print socket.gethostbyaddr(socket.gethostname())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.herror: [Errno 1] Unknown host

root@15-89:~# python -c "import socket; print socket.getfqdn()"
15-89
root@15-89:~# hostname
15-89
root@15-89:~# hostname -f
15-89.maaspxe5

IP addresses in that 15-89 node (eth2 has a global IPv6 address):
$ for n in 0 1 2; do juju ssh 0 ip a show dev eth$n; done 2>/dev/null
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master juju-br0 state UP group default qlen 1000
    link/ether 52:54:00:ab:15:89 brd ff:ff:ff:ff:ff:ff
    inet 10.0.5.150/24 brd 10.0.5.255 scope global eth0
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:fa:49:0f brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fefa:490f/64 scope link
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:23:f7:09 brd ff:ff:ff:ff:ff:ff
    inet6 fdde:a571:33af:0:5054:ff:fe23:f709/64 scope global dynamic
       valid_lft forever preferred_lft forever
    inet6 2804:7f4:zzzz:xxxx:yyyy:ff:fe23:f709/64 scope global dynamic
       valid_lft 40992sec preferred_lft 40992sec
    inet6 fe80::5054:ff:fe23:f709/64 scope link
       valid_lft forever preferred_lft forever

I also got the incomplete DNS record in the MAAS server DNS:
# grep -E 15.89 -r /etc/bind
/etc/bind/maas/zone.5.0.10.in-addr.arpa:150.5.0.10.in-addr.arpa. IN PTR 15-89.maaspxe5.
/etc/bind/maas/zone.maaspxe5:15-89 IN A 10.0.5.150
/etc/bind/maas/zone.maaspxe5:15-89 IN AAAA 2804:7f4:zzzz:xxxx:yyyy:ff:fe23:f709

But I could deploy ubuntu in containers without hitting the problem from this bug:
$ juju status --format=tabular
[Services]
NAME STATUS EXPOSED CHARM
ubuntu unknown false cs:trusty/ubuntu-0

[Units]
ID WORKLOAD-STATE AGENT-STATE VERSION MACHINE PORTS PUBLIC-ADDRESS MESSAGE
ubuntu/0 unknown executing 1.25.5 0/lxc/0 10.0.5.11 (update-status)
ubuntu/1 unknown idle 1.25.5 0/lxc/1 10.0.5.14
ubuntu/2 unknown idle 1.25.5 0/lxc/2 10.0.5.17

[Machines]
ID STATE VERSION DNS INS-ID SERIES HARDWARE
0 started 1.25.5 15-89.maaspxe5 /MAAS/api/1.0/nodes/node-7665d2aa-2111-11e6-bce2-00163ed66ea1/ trusty arch=amd64 cpu-cores=4 mem=1024M availability-zone=default

This datapoint of yours, however, that what changed between then and now is the IPv6 network, is interesting.

How is or was the connectivity of that IPv6 subnet you had? Would the wget command Cheryl pasted ...

Read more...

Changed in juju-core:
status: Triaged → Incomplete
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta9 → 2.0-beta10
Revision history for this message
Larry Michel (lmic) wrote :

We have been seeing this quite a bit today. It's been causing a lot of build failures for the past few hours:

  '5':
    agent-state: started
    agent-version: 1.25.5
    containers:
      5/lxc/0:
        agent-state: error
        agent-state-info: 'failed to retrieve the template to clone: cannot determine
          cached image URL: cannot determine LXC image URL: cannot determine LXC image
          URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt

          : exit status 1: cannot determine LXC image URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt

          : exit status 1'
        instance-id: pending
        series: trusty
      5/lxc/1:
        agent-state: error
        agent-state-info: 'failed to retrieve the template to clone: cannot determine
          cached image URL: cannot determine LXC image URL: cannot determine LXC image
          URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt

          : exit status 1: cannot determine LXC image URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt

          : exit status 1'
        instance-id: pending
        series: trusty
    dns-name: hayward-16.oil
    hardware: arch=amd64 cpu-cores=8 mem=16384M tags=hardware-sm15k,hw-glance-sm15k,hw-production
      availability-zone=Production
    instance-id: /MAAS/api/1.0/nodes/node-a336d312-c4cd-11e3-824b-00163efc5068/
    series: trusty

tags: added: oil
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Larry - in your case, did all the containers on machine 5 fail with that error? Or just some?

Revision history for this message
Larry Michel (lmic) wrote :
Download full text (11.9 KiB)

Cheryl, yes, all the containers. Here's another example with entire status:

environment: ci-oil-slave9
machines:
  '0':
    agent-state: started
    agent-version: 1.25.5
    dns-name: huffman-vm-06.oil
    hardware: arch=ppc64el cpu-cores=4 mem=8229M tags=hardware-ibm-power8-S822L,hw-production,blacklist-juno,blacklist-icehouse,blacklist-precise,blacklist-compute,vm-ppc64el
      availability-zone=Production
    instance-id: /MAAS/api/1.0/nodes/node-617fecf2-a0e5-11e4-9cac-00163eca07b6/
    series: trusty
    state-server-member-status: has-vote
  '1':
    agent-state: started
    agent-version: 1.25.5
    containers:
      1/lxc/0:
        agent-state: started
        agent-version: 1.25.5
        dns-name: 10.245.46.22
        hardware: arch=amd64
        instance-id: juju-machine-1-lxc-0
        series: trusty
      1/lxc/1:
        agent-state: started
        agent-version: 1.25.5
        dns-name: 10.245.47.165
        hardware: arch=amd64
        instance-id: juju-machine-1-lxc-1
        series: trusty
    dns-name: ekans.oil
    hardware: arch=amd64 cpu-cores=32 mem=65536M tags=hw-production,hardware-dell-poweredge-R720XD
      availability-zone=Production
    instance-id: /MAAS/api/1.0/nodes/node-1490c87a-c615-11e5-b7b2-00163eca07b6/
    series: trusty
  '2':
    agent-state: started
    agent-version: 1.25.5
    containers:
      2/lxc/0:
        agent-state: error
        agent-state-info: 'failed to retrieve the template to clone: cannot determine
          cached image URL: cannot determine LXC image URL: cannot determine LXC image
          URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt

          : exit status 1: cannot determine LXC image URL: failed to get https://cloud-images.ubuntu.com/query/trusty/server/released-dl.current.txt

          : exit status 1'
        instance-id: pending
        series: trusty
    dns-name: hayward-13.oil
    hardware: arch=amd64 cpu-cores=8 mem=16384M tags=hardware-sm15k,hw-glance-sm15k,hw-production
      availability-zone=Production
    instance-id: /MAAS/api/1.0/nodes/node-9e308d0e-c4cd-11e3-8102-00163efc5068/
    series: trusty
  '3':
    agent-state: started
    agent-version: 1.25.5
    containers:
      3/lxc/0:
        agent-state: started
        agent-version: 1.25.5
        dns-name: 10.245.43.102
        hardware: arch=ppc64el
        instance-id: juju-machine-3-lxc-0
        series: trusty
    dns-name: huffman-vm-01.oil
    hardware: arch=ppc64el cpu-cores=4 mem=8229M tags=hardware-ibm-power8-S822L,hw-production,blacklist-juno,blacklist-icehouse,blacklist-precise,blacklist-compute,vm-ppc64el
      availability-zone=Production
    instance-id: /MAAS/api/1.0/nodes/node-5c8f2d56-9c45-11e4-bf57-00163eca07b6/
    series: trusty
  '4':
    agent-state: started
    agent-version: 1.25.5
    containers:
      4/lxc/0:
        agent-state: started
        agent-version: 1.25.5
        dns-name: 10.245.33.145
        hardware: arch=ppc64el
        instance-id: juju-machine-4-lxc-0
        series: trusty
    dns-name: huffman-vm-03.oil
    hardware: arch=ppc64el cpu-cores=4 mem=8229M tags=hardware-ibm-power8-S822L,hw-production,blacklist-ju...

Revision history for this message
Larry Michel (lmic) wrote :

Previous example was only a single container.. I was looking for one like machine 5, but only saw systems with single container. Here's entire status file for that earlier example: https://pastebin.canonical.com/159353/

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta10 → 2.0-beta11
Revision history for this message
Larry Michel (lmic) wrote :

In our case, the issues completely went away after we reloaded an SM15k chassis. We also found a duplicate IP that was due to a maas bug but I don't think it was the primary issue.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta11 → 2.0-beta12
Changed in juju-core:
milestone: 2.0-beta12 → none
Changed in juju-core:
status: Incomplete → Invalid
Revision history for this message
Florian Haas (fghaas) wrote :
Download full text (5.1 KiB)

This bug is definitely not invalid; it still exists in 1.25.8-trusty-amd64 and is an extraordinary pain.

As far as i can tell, what's happening here is that the target machine attempts to download the image, but the image cache on the bootstrap node never gets populated:

2016-12-01 09:25:46 DEBUG juju.container image.go:89 lxc image for trusty (amd64) is https://cloud-images.ubuntu.com/server/releases/trusty/release-20161109/ubuntu-14.04-server-cloudimg-amd64.tar.gz
2016-12-01 09:25:46 DEBUG juju.container.lxc lxc.go:403 creating lxc container "juju-trusty-lxc-template"
2016-12-01 09:25:46 DEBUG juju.container.lxc lxc.go:404 lxc-create template params: [--debug --userdata /var/lib/juju/containers/juju-trusty-lxc-template/cloud-init --hostid juju-trusty-lxc-template -r trusty -T https://192.168.122.100:17070/environment/23487fc9-6e44-49a0-8ddf-8472ede8a1ee/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz]
2016-12-01 09:25:46 ERROR juju.container.lxc clonetemplate.go:129 lxc template container creation failed: lxc container creation failed: juju-trusty-lxc-template

Note that there is no gap in there at all; the lxc-create call immediately fails. This is a new machine; if the whole template tarball were downloaded it would at least take several seconds.

Trying to force container template re-creation on the target node ("rm /var/lib/juju/containers/juju-trusty-lxc-template -rf") doesn't do a thing either. After doing so and running "juju deploy <service> --to lxc:1", this is the entirety of the contents of that directory:

$ ls /var/lib/juju/containers/juju-trusty-lxc-template -la
total 16
drwxr-xr-x 2 root root 4096 Dec 1 09:46 .
drwxr-xr-x 4 root root 4096 Dec 1 09:46 ..
-rw-r--r-- 1 root root 1653 Dec 1 09:47 cloud-init
-rw-r--r-- 1 root root 111 Dec 1 09:47 lxc.conf

In other words, that template just never gets downloaded.

Now, digging in a bit deeper and running the lxc-create command manually:

sudo lxc-create -P /var/lib/juju/containers -n test -t ubuntu-cloud -- --debug --userdata /var/lib/juju/containers/juju-trusty-lxc-template/cloud-init --hostid juju-trusty-lxc-template -r trusty -T https://192.168.122.100:17070/environment/23487fc9-6e44-49a0-8ddf-8472ede8a1ee/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz
+ '[' amd64 = i686 ']'
+ '[' 0 = 0 ']'
+ case "$hostarch:$arch" in
+ :
+ '[' tryreleased '!=' daily -a tryreleased '!=' released -a tryreleased '!=' tryreleased ']'
+ '[' -z /var/lib/juju/containers/test ']'
++ id -u
+ '[' 0 '!=' 0 ']'
+ config=/var/lib/juju/containers/test/config
+ '[' -z /var/lib/juju/containers/test/rootfs ']'
+ type ubuntu-cloudimg-query
ubuntu-cloudimg-query is /usr/bin/ubuntu-cloudimg-query
+ type wget
wget is /usr/bin/wget
+ cache=/var/cache/lxc/cloud-trusty
+ '[' 0 -eq 1 ']'
+ mkdir -p /var/cache/lxc/cloud-trusty
+ '[' tryreleased = tryreleased ']'
+ stream=released
+ ubuntu-cloudimg-query trusty released amd64
+ '[' -n https://192.168.122.100:17070/environment/23487fc9-6e44-49a0-8ddf-8472ede8a1ee/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz ']'
+ url2=https://192.168.122.100:17070/environment/23487fc9-6e44-49a0-8ddf-8472ede8a...

Read more...

Revision history for this message
Richard Harding (rharding) wrote :

Thank you for the added information Florian. From your information it looks like there is indeed some sort of race condition, possibly in requesting we retrieve image and when we attempt to use it.

We'll give this another look and see if we can chase it down with your added details.

For the SSL issue, my assumption is that by running the command manually the pre-setup shared self-signed cert from the controller to the unit you were on wasn't loaded into the shell environment such that it could be used with wget. I'll have to see if you were to run that same thing in a hook context via something like "juju run" if that would be able to communicate with the controller API over SSL properly or not.

Revision history for this message
Florian Haas (fghaas) wrote :

@rharding consider undoing the "invalid" flag for this issue? It persists as of today.

Changed in juju-core:
status: Invalid → Triaged
assignee: nobody → Richard Harding (rharding)
milestone: none → 1.25.10
Changed in juju-core:
importance: High → Critical
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.10 → none
Changed in juju-core:
milestone: none → 1.25.11
Changed in juju-core:
assignee: Richard Harding (rharding) → nobody
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.11 → none
Revision history for this message
Florian Haas (fghaas) wrote :

So this bug is Critical, has no milestone, and no-one is assigned to fix it? Considering Ubuntu OpenStack users who deployed on Trusty are still stuck on Juju 1.x, with no upgrade path beyond Mitaka, that is just terrific.

Revision history for this message
Tim Penhey (thumper) wrote :

I'm just being honest here, we aren't going to fix this on the 1.25 series.

Changed in juju-core:
importance: Critical → Undecided
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.