LXD containers fail to upgrade because the bridge config changes to a different IP address

Bug #1569361 reported by Andrew McDermott
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Andrew McDermott

Bug Description

It's not possible to run:

 $ juju upgrade-juju --upload-tools

when a container has been deployed and for the container to be upgrade in line with the version running on its host.

The steps are:

 0) bootstrap (on MAAS)
 1) juju add-machine lxd:0
 2) juju upgrade-juju --upload-tools

Post the upgrade you see a mismatch between the version on the containers host and the container itself:

model: admin
machines:
  "0":
    juju-status:
      current: started
      since: 12 Apr 2016 14:02:18+01:00
      version: 2.0-beta4.2
    dns-name: 10.17.20.201
    instance-id: /MAAS/api/1.0/nodes/node-0d05c466-ffef-11e5-a124-52540098ca47/
    machine-status:
      current: running
      message: Deployed
      since: 12 Apr 2016 13:47:05+01:00
    series: xenial
    containers:
      0/lxd/0:
        juju-status:
          current: started
          since: 12 Apr 2016 13:52:48+01:00
          version: 2.0-beta4.1
        dns-name: 10.17.20.202
        instance-id: juju-machine-0-lxd-0
        machine-status:
          current: running
          message: Container started
          since: 12 Apr 2016 13:51:27+01:00
        series: xenial
    hardware: arch=amd64 cpu-cores=1 mem=1024M availability-zone=default
    controller-member-status: has-vote

This happens because the container cannot reach the API server and that happens because (*I think*) the `lxd init' runs again and changes the setup of /etc/default/lxd-bridge.

The 1/etc/... and 2/etc... paths refer to the steps above (i.e., 1/etc/... is a capture after add-machine and 2/etc/... is after the upgrade step).

--- 1/etc/default/lxd-bridge 2016-04-12 12:50:21.000000000 +0000
+++ 2/etc/default/lxd-bridge 2016-04-12 13:02:49.000000000 +0000
@@ -20,16 +20,16 @@

 # IPv4
 ## IPv4 address (e.g. 10.0.8.1)
-LXD_IPV4_ADDR="10.0.1.1"
+LXD_IPV4_ADDR="10.0.5.1"

 ## IPv4 netmask (e.g. 255.255.255.0)
 LXD_IPV4_NETMASK="255.255.255.0"

 ## IPv4 network (e.g. 10.0.8.0/24)
-LXD_IPV4_NETWORK="10.0.1.1/24"
+LXD_IPV4_NETWORK="10.0.5.1/24"

 ## IPv4 DHCP range (e.g. 10.0.8.2,10.0.8.254)
-LXD_IPV4_DHCP_RANGE="10.0.1.2,10.0.1.254"
+LXD_IPV4_DHCP_RANGE="10.0.5.2,10.0.5.254"

 ## IPv4 DHCP number of hosts (e.g. 250)
 LXD_IPV4_DHCP_MAX="253"
@@ -53,4 +53,8 @@
 # Run a minimal HTTP PROXY server
 LXD_IPV6_PROXY="false"

-EXISTING_BRIDGE=
+EXISTING_BRIDGE=""

There is also a difference with the apiaddresses between steps (1) and (2):

ubuntu@node1:~$ diff -u {1,2}/var/lib/juju/agents/machine-0/agent.conf
--- 1/var/lib/juju/agents/machine-0/agent.conf 2016-04-12 12:47:07.000000000 +0000
+++ 2/var/lib/juju/agents/machine-0/agent.conf 2016-04-12 13:02:17.000000000 +0000
@@ -7,7 +7,7 @@
 jobs:
 - JobManageModel
 - JobHostUnits
-upgradedToVersion: 2.0-beta4.1
+upgradedToVersion: 2.0-beta4.2
 cacert: |
   -----BEGIN CERTIFICATE-----
   MIICxTCCAi6gAwIBAgIVANLBvgivEAbjxYl0FJD1BB1tbut4MA0GCSqGSIb3DQEB
@@ -31,6 +31,7 @@
 statepassword: kerSShV9Tkg4vkY/WH5GcIzI
 model: model-2d4e133e-f3e4-4401-8303-e2a4c6b51c10
 apiaddresses:
+- 10.0.4.1:17070
 - 10.17.20.201:17070
 apipassword: kerSShV9Tkg4vkY/WH5GcIzI
 oldpassword: c7f003d4ec1c6f3e743cdd829e3149eb

But NOT between steps (0) and (1):

diff -u {0,1}/var/lib/juju/agents/machine-0/agent.conf
<empty result>

Looking at the containers log: machine-0-lxd-0.log

You see the upgrade trying to make progress:

2016-04-12 13:02:52 INFO juju.worker.upgrader upgrader.go:178 desired tool version: 2.0-beta4.2
2016-04-12 13:02:52 INFO juju.worker.upgrader upgrader.go:199 upgrade requested from 2.0-beta4.1 to 2.0-beta4.2
2016-04-12 13:02:52 INFO juju.worker.upgrader upgrader.go:251 fetching tools from "https://10.0.4.1:17070/model/2d4e133e-f3e4-4401-8303-e2a4c6b51c10/tools/2.0-beta4.2-xenial-amd64"

But (from the container) there is nothing listening at this address:

root@juju-machine-0-lxd-0:/var/log/juju# telnet 10.0.4.1 17070
Trying 10.0.4.1...

And back on the containers host the bridge is now 10.0.5.1

$ ifconfig -a lxdbr0
lxdbr0 Link encap:Ethernet HWaddr 62:95:97:93:bc:3e
          inet addr:10.0.5.1 Bcast:0.0.0.0 Mask:255.255.255.0
          inet6 addr: fe80::6095:97ff:fe93:bc3e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B) TX bytes:570 (570.0 B)

Perhaps the upgrade step should try on all API addresses:

$ sudo cat /var/lib/juju/agents/machine-0/agent.conf |grep -A 2 apiaddresses
apiaddresses:
- 10.0.4.1:17070
- 10.17.20.201:17070

In capturing the output from this bug it's not entirely clear to me how the LXD bridge went from 10.0.1.1 to 10.0.5.1 but the only recorded address in agents.conf is 10.0.4.1. Perhaps it is related to the `lxd init' step which will trigger everytime the MA is started.

The net effect is that the containers upgrade step never completes.

Tags: network
summary: - LXD containers fail to upgrade because the bridge config changes on each
- upgrade step
+ LXD containers fail to upgrade because the bridge config changes to a
+ different IP address
tags: added: network
Revision history for this message
Andrew McDermott (frobware) wrote :
Revision history for this message
Andrew McDermott (frobware) wrote :
Revision history for this message
Andrew McDermott (frobware) wrote :
Changed in juju-core:
status: New → Triaged
milestone: none → 2.0-rc1
importance: Undecided → Critical
Changed in juju-core:
assignee: nobody → Andrew McDermott (frobware)
Revision history for this message
Andrew McDermott (frobware) wrote :
Revision history for this message
Andrew McDermott (frobware) wrote :

With the PR in #4 I now see the LXD container get upgraded:

$ juju show-machines
model: admin
machines:
  "0":
    juju-status:
      current: started
      since: 12 Apr 2016 16:40:35+01:00
      version: 2.0-beta4.2
    dns-name: 10.17.20.201
    instance-id: /MAAS/api/1.0/nodes/node-0d05c466-ffef-11e5-a124-52540098ca47/
    machine-status:
      current: running
      message: Deployed
      since: 12 Apr 2016 16:32:47+01:00
    series: trusty
    containers:
      0/lxd/0:
        juju-status:
          current: started
          since: 12 Apr 2016 16:40:57+01:00
          version: 2.0-beta4.2
        dns-name: 10.17.20.202
        instance-id: juju-machine-0-lxd-0
        machine-status:
          current: running
          message: Container started
          since: 12 Apr 2016 16:38:28+01:00
        series: trusty
    hardware: arch=amd64 cpu-cores=1 mem=1024M availability-zone=default
    controller-member-status: has-vote

Changed in juju-core:
status: Triaged → Fix Committed
Revision history for this message
Andrew McDermott (frobware) wrote :

Reopening because post reviews by Tycho/John the function was doing exactly what is was supposed to.

Changed in juju-core:
status: Fix Committed → In Progress
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta5 → 2.0-rc1
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta6 → 2.0-beta7
Revision history for this message
Andrew McDermott (frobware) wrote :
Revision history for this message
Andrew McDermott (frobware) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta7 → 2.0-beta8
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta8 → none
milestone: none → 2.0-beta8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.