wget cert issues causing failure to create containers

Bug #1512782 reported by Mike McCracken on 2015-11-03
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju
High
Unassigned
juju-core
Undecided
Unassigned
1.25
Undecided
Unassigned

Bug Description

This affects both 1.25 and 1.24, tested in wily

For 1.25, some details with logs are here:
https://gist.github.com/mikemccracken/883f91e34c3eb8ee4b47

We are also still seeing this in 1.24, as shown here: http://paste.ubuntu.com/13087888/

Cheryl Jennings (cherylj) wrote :

This could be a regression, causing bug 1472014 to appear again.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Cheryl Jennings (cherylj) wrote :

I looked through the 1.25 details, and I think the wget cert issues may be a red herring. When juju creates containers, it creates a wrapped wget command that specifies cert details, sticks it in tmp and puts it in the path for the lxc-create command. You'll see something like this in the output:

+ type wget
wget is /tmp/wget531732860/wget

I see in the output from your test lxc-create the following:
+ type wget
wget is /usr/bin/wget

Whatever the tmp dir is that has the wrapped wget is not in your path, and wouldn't be picked up.

If you can recreate, can you set the logging level to TRACE (juju set-env logging-config="<root>=TRACE") and attach the following:
- machine logs for state server and the machine hosting lxc containers
- the contents of /var/lib/juju/containers/* on the hosting machine

Mike McCracken (mikemc) wrote :

I was able to recreate this last night with trace on, here's a relevant snippet from the all-machines.log:

 machine-1[3840]: 2015-11-04 06:46:18 DEBUG juju.container.lxc lxc.go:426 lxc-create template params: [--debug --userdata /var/lib/juju/containers/juju-trusty-lxc-template/cloud-init --hostid juju-trusty-lxc-template -r trusty -T https://10.0.3.119:17070/environment/dacb71e7-b0f4-4a47-88e9-e18216a2ddfd/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz]
 machine-1[3840]: 2015-11-04 06:46:18 TRACE golxc.run.lxc-create golxc.go:448 run: lxc-create [-n juju-trusty-lxc-template -t ubuntu-cloud -f /var/lib/juju/containers/juju-trusty-lxc-template/lxc.conf -- --debug --userdata /var/lib/juju/containers/juju-trusty-lxc-template/cloud-init --hostid juju-trusty-lxc-template -r trusty -T https://10.0.3.119:17070/environment/dacb71e7-b0f4-4a47-88e9-e18216a2ddfd/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz]
 machine-1[3840]: 2015-11-04 06:46:19 TRACE golxc.run.lxc-create golxc.go:458 run failed output: + '[' amd64 = i686 ']'
 machine-1[3840]: 2015-11-04 06:46:19 ERROR juju.container.lxc clonetemplate.go:129 lxc template container creation failed: lxc container creation failed: juju-trusty-lxc-template
 machine-1[3840]: 2015-11-04 06:46:19 INFO juju.container lock.go:66 release lock "juju-trusty-lxc-template"
 machine-1[3840]: 2015-11-04 06:46:19 ERROR juju.provisioner.lxc lxc-broker.go:168 failed to start container: failed to retrieve the template to clone: lxc container creation failed: juju-trusty-lxc-template
 machine-1[3840]: 2015-11-04 06:46:19 INFO juju.provisioner provisioner_task.go:711 retryable error received on start instance - retrying instance creation 3 times with a 10s delay
 machine-0: 2015-11-04 06:46:18 DEBUG juju.storage managedstorage.go:293 managed resource entry created with path "environs/dacb71e7-b0f4-4a47-88e9-e18216a2ddfd/charms/cs:trusty/ntp-11-d3bdedc0-ccc9-4fc7-8c29-3c5ded577afd" -> "d842a60653adca970ade502e62fd3c7cd27c6d02902aee4f5f6b1a9d748abc7573d8df66605cf4c4fcd1fb9e835380a4"

it is odd that that the only output from the lxc-create command is just the first line. it shouldn't fail between that line and the next thing that'd be printed in that script.

Unfortunately the machine I reproduced it on had a power failure last night and the environment did not come back up cleanly, so I had to try to repro it again, and so far it is failing to exhibit the same behavior.

Mike McCracken (mikemc) wrote :
Download full text (4.9 KiB)

OK, I reproduced the error again. Here's some further info:

https://gist.github.com/mikemccracken/883f91e34c3eb8ee4b47#file-reproduced-with-further-info-md

In the middle of the machine-1.log, you see this:

2015-11-04 22:48:33 DEBUG juju.container.lxc lxc.go:425 creating lxc container "juju-trusty-lxc-template"
2015-11-04 22:48:33 DEBUG juju.container.lxc lxc.go:426 lxc-create template params: [--debug --userdata /var/lib/juju/containers/juju-trusty-lxc-template/cloud-init --hostid juju-trusty-lxc-template -r trusty -T https://10.0.3.45:17070/environment/6337473f-b80a-4ed3-8a5b-9a789e3b25f9/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz]
2015-11-04 22:48:33 TRACE golxc.run.lxc-create golxc.go:448 run: lxc-create [-n juju-trusty-lxc-template -t ubuntu-cloud -f /var/lib/juju/containers/juju-trusty-lxc-template/lxc.conf -- --debug --userdata /var/lib/juju/containers/juju-trusty-lxc-template/cloud-init --hostid juju-trusty-lxc-template -r trusty -T https://10.0.3.45:17070/environment/6337473f-b80a-4ed3-8a5b-9a789e3b25f9/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz]
2015-11-04 22:48:33 TRACE golxc.run.lxc-create golxc.go:458 run failed output: + '[' amd64 = i686 ']'
+ '[' 0 = 0 ']'
+ case "$hostarch:$arch" in
+ :
+ '[' tryreleased '!=' daily -a tryreleased '!=' released -a tryreleased '!=' tryreleased ']'
+ '[' -z /var/lib/lxc/juju-trusty-lxc-template ']'
++ id -u
+ '[' 0 '!=' 0 ']'
+ config=/var/lib/lxc/juju-trusty-lxc-template/config
+ '[' -z /var/lib/lxc/juju-trusty-lxc-template/rootfs ']'
+ type ubuntu-cloudimg-query
ubuntu-cloudimg-query is /usr/bin/ubuntu-cloudimg-query
+ type wget
wget is /tmp/wget992629130/wget
+ cache=/var/cache/lxc/cloud-trusty
+ '[' 0 -eq 1 ']'
+ mkdir -p /var/cache/lxc/cloud-trusty
+ '[' tryreleased = tryreleased ']'
+ stream=released
+ ubuntu-cloudimg-query trusty released amd64
+ '[' -n https://10.0.3.45:17070/environment/6337473f-b80a-4ed3-8a5b-9a789e3b25f9/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz ']'
+ url2=https://10.0.3.45:17070/environment/6337473f-b80a-4ed3-8a5b-9a789e3b25f9/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz
++ basename https://10.0.3.45:17070/environment/6337473f-b80a-4ed3-8a5b-9a789e3b25f9/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz
+ filename=ubuntu-14.04-server-cloudimg-amd64-root.tar.gz
+ '[' -n https://10.0.3.45:17070/environment/6337473f-b80a-4ed3-8a5b-9a789e3b25f9/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz ']'
+ do_extract_rootfs
+ cd /var/cache/lxc/cloud-trusty
+ '[' 0 -eq 1 ']'
+ trap wgetcleanup EXIT SIGHUP SIGINT SIGTERM
+ '[' '!' -f ubuntu-14.04-server-cloudimg-amd64-root.tar.gz ']'
+ wget https://10.0.3.45:17070/environment/6337473f-b80a-4ed3-8a5b-9a789e3b25f9/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz
--2015-11-04 22:48:33-- https://10.0.3.45:17070/environment/6337473f-b80a-4ed3-8a5b-9a789e3b25f9/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz
Connecting to 10.0.3.45:17070... connected.
ERROR: no certificate subject alternative name matches
 requested host name '10.0.3.45'.
...

Read more...

Cheryl Jennings (cherylj) wrote :

Ah, so it is a wget / cert issue. Bummer. Can you send in the log for machine-0?

Cheryl Jennings (cherylj) wrote :

Also, I have a couple questions for your about your environment:
- Are you using the local provider?
- Are you using nested containers? It's unusual to see containers get something other than 10.0.3. *, and I see that yours picked up 10.0.6.*
- What does the network look like on your host?
- Is this just surfacing with wily as a host?

Mike McCracken (mikemc) wrote :
Download full text (4.0 KiB)

attached machine-0.log as a gzipped file, it was 40MB.

as to your other questions:

This is using the local provider. Here's the environments.yaml:

    default: local

    environments:
      local:
        type: local
        container: kvm
        lxc-clone: true
        authorized-keys-path: ~/.ssh/id_rsa.pub
        network-bridge: lxcbr0
        default-series: trusty
        admin-secret: "ubuntu"
        no-proxy: localhost

It is using nested containers, yes. There's a top level container,
inside which we install juju with the above env.yaml, and deploy from in there.
So the containers which juju creates are nested one level deep.

Note that the machines created by juju in earlier logs had 10.0.6.x,
but the latest one has 10.0.7.x.

We explicitly set lxcbr0 to 10.0.N.x where N is 6 or
greater, to avoid clashing with other installs when we add a static
route to the container network from the host. We manually set the lxc-net
config in the container. In this case it's the following:

    ubuntu@openstack-single-ubuntu:~$ cat /etc/default/lxc-net
    # Custom lxc-net for Single Installer
    USE_LXC_BRIDGE="true"
    LXC_BRIDGE="lxcbr0"
    LXC_ADDR=10.0.7.1
    LXC_NETMASK=255.255.255.0
    LXC_NETWORK=10.0.7.0/24
    LXC_DHCP_RANGE="10.0.7.2,10.0.7.254"
    LXC_DHCP_MAX="196"
    #LXC_DHCP_CONFILE=/etc/lxc/dnsmasq.conf
    #LXC_DOMAIN="lxc"

Here's some further network info. 'ancient-spot' is the top level
host. The top level container is called 'openstack-single-ubuntu'.

    ubuntu@ancient-spot:~⟫ sudo ip addr show
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
        inet6 ::1/128 scope host
           valid_lft forever preferred_lft forever
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
        link/ether ec:a8:6b:fb:3a:5e brd ff:ff:ff:ff:ff:ff
        inet 172.16.0.66/24 brd 172.16.0.255 scope global eth0
           valid_lft forever preferred_lft forever
        inet6 fe80::eea8:6bff:fefb:3a5e/64 scope link
           valid_lft forever preferred_lft forever
    3: lxcbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
        link/ether fe:bb:ba:5e:ec:13 brd ff:ff:ff:ff:ff:ff
        inet 10.0.3.1/24 scope global lxcbr0
           valid_lft forever preferred_lft forever
        inet6 fe80::f897:dff:fe56:8118/64 scope link
           valid_lft forever preferred_lft forever
    6: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
        link/ether 52:54:00:cf:09:a4 brd ff:ff:ff:ff:ff:ff
        inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
           valid_lft forever preferred_lft forever
    7: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN group default qlen 500
        link/ether 52:54:00:cf:09:a4 brd ff:ff:ff:ff:ff:ff
    17: vethPE50B4@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master lxcbr0 state UP group default qle...

Read more...

Cheryl Jennings (cherylj) wrote :

I see what the problem is now. For local instances, when queried for the addresses by the certupdater, we only ever return "localhost" and the bridge address:

                addrs := []network.Address{{
                        Scope: network.ScopePublic,
                        Type: network.HostName,
                        Value: "localhost",
                }, {
                        Scope: network.ScopeCloudLocal,
                        Type: network.IPv4Address,
                        Value: inst.env.bridgeAddress,
                }}

But, we record all of the addresses on the host machine as possible API endpoints. So, a machine could come up and connect to the API through whatever the host machine has as eth0, and that wouldn't be one of the certificate addresses.

You are running into this, where most don't see it, because for most people, the lxc bridge IP is lexicographically less than eth0. juju will sort IPs for a machine before storing them, and machines will go in that order when attempting to connect to the API.

In your case, the eth0 address for the host / bootstrap node will always appear first in the APIHosts list, and therefore will be chosen as the address machine1 uses to connect to the state server, even though it wasn't used when generating the certificate.

I need to work with some folks to determine how we should fix this. We could either generate the certificate for all the IPs we're going to list as APIHosts, or restrict what we list as APIHosts for the local provider.

Cheryl Jennings (cherylj) wrote :

Talked with William this morning and I'll be updating the certupdater to update the cert for all of the APIHostPorts. This will also require a change for the certupdater to watch for changes to the APIHostPorts, rather than just address changes for the machine.

Cheryl Jennings (cherylj) wrote :

Turns out the fix may be more complicated than just now watching APIHostPorts. There could be racy conditions as the certupdater worker runs on each state server in an HA environment. This may need to be made a singular worker. Still determining best fix.

For immediate relief, you can modify the lxc bridge for the host to be "higher" than the lxc bridge for the kvm containers running the local provider. i.e. using 10.1.3.0/24 instead of 10.0.3.0/24.

Adam Stokes (adam-stokes) wrote :

Attempted the workaround in #10, same problem still exists:

https://gist.github.com/battlemidget/67e2a76ee227646f6f7a

affects: juju-core → juju
Changed in juju-core:
status: New → Won't Fix
Changed in juju:
milestone: none → 2.1.0
Anastasia (anastasia-macmood) wrote :

There has been some changes on how we treat addresses and how we sort them referred to bug # 1473069.
I will mark this as a duplicate.

@Adam,

If you are still seeing a problem unrelated to the above bug, please file a separate bug with newer information. This original report is fairly outdated.

Thank you for your patience :D

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers