juju agent using lxcbr0 address as apiaddress instead of juju-br0 breaks agents

Bug #1416928 reported by Fabricio Costi
50
This bug affects 7 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
James Tunnicliffe
1.21
Fix Released
Critical
Dimiter Naydenov
1.22
Fix Released
Critical
James Tunnicliffe

Bug Description

juju-core 1.21.1

node 0: bootstrap, lxc/0-2
    - juju-br0 10.10.18.51 (eth0)
    - lxcbr0 10.0.3.1
    - eth1 in promisc mode
node 1: lxc/0-6
    - juju-br0 10.10.18.52 (eth0)
    - lxcbr0 10.0.3.1
    - virbr0 192.168.122.1
    - eth1 in promisc mode
node 2: metal only
    - juju-br0 10.10.18.53 (eth0)
    - virbr0 192.168.122.1
    - eth1 in promisc mode

- on node 0, physical and lxc machines and all unit agents have the ip from node 0 lxcbr0 bridge assigned to apiaddress in agent.conf. They report correctly their state.

- on node 1, physical and lxc machine and all unit agents have the node 0 lxcbr0 ip assigned to apiaddress in agent.conf. They can't reach WSS.

- on node 2, physical machine agent have the node 0 lxcbr0 ip assigned to apiaddress in agent.conf. Can't reach WSS.

I am assuming all agents are getting the lxcbr0 ip from node 0 as node 2 do not have an lxcbr0 and yet it has the 10.0.3.1 ip assigned to its apiaddress in agent.conf.

Manually changing apiaddress on all agents to juju-br0 ip of node 0, stopping and starting all the agents temporarily solved the issue.

After reboot, same problem happens as agent.conf is overwritten by juju on all machines, lxc and units.

To reproduce with 1.21 in a cloud:
juju bootstrap
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - REAL_ETH0_ADDRESS:17070
juju ssh 0 "sudo apt-get install -y lxc"
juju ssh 0 "sudo reboot -n"
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - 10.0.3.1:17070

Or with manual-provider with lxc installed on the target state-server:
juju bootstrap
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - 10.0.3.1:17070

Tags: api lxc network
description: updated
summary: - juju agent using lxcbr0 address as apiaddress instead of juju-br0 after
- reboot
+ juju agent using lxcbr0 address as apiaddress instead of juju-br0 breaks
+ agents
description: updated
Curtis Hovey (sinzui)
tags: added: lxc network
tags: added: api
Revision history for this message
Paul Gear (paulgear) wrote :

I'm seeing this error on new deploys on juju 1.21.1-trusty-amd64 from the stable PPA. I've attached an extract of the machine log showing it changing the API addresses from the correct values to the wrong values.?field.comment=I'm seeing this error on new deploys on juju 1.21.1-trusty-amd64 from the stable PPA. I've attached an extract of the (redacted) machine log showing it changing the API addresses from the correct value (192.168.99.0) to the wrong value (10.0.3.1).

Revision history for this message
Paul Gear (paulgear) wrote :

(Apologies for the repeated comment above - no idea what happened there.)

There are corresponding messages on machine 0, where it shows:

2015-02-03 01:17:36 INFO juju.worker.machiner machiner.go:92 setting addresses for machine-0 to ["local-machine:127.0.0.1" "local-cloud:192.168.99.0" "local-cloud:10.0.3.1" "local-machine:::1"]
...
2015-02-03 01:17:36 DEBUG juju.apiserver apiserver.go:156 <- [6D] machine-0 {"RequestId":21,"Type":"Machiner","Request":"SetMachineAddresses","Params":{"MachineAddresses":[{"Tag":"machine-0","Addresses":[{"Value":"127.0.0.1","Type":"ipv4","NetworkName":"","Scope":"local-machine"},{"Value":"192.168.99.0","Type":"ipv4","NetworkName":"","Scope":"local-cloud"},{"Value":"10.0.3.1","Type":"ipv4","NetworkName":"","Scope":"local-cloud"},{"Value":"::1","Type":"ipv6","NetworkName":"","Scope":"local-machine"}]}]}}

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.23
Revision history for this message
Fabricio Costi (fabricio-9) wrote :

This happens only AFTER the bootstrap node is rebooted for the first time.

On a new env I was able to reboot all other nodes without a problem - apiaddress still points to the right IP of bootstrap node - provided the bootstrap node hadn't been rebooted after the first lxc was deployed.

By inspecting the agent.conf from LXC running on the bootstrap node I could see the apiaddress is pointing to the right ip BEFORE reboot.

AFTER the first reboot of bootstrap node, the apiaddress was changed to the lxcbr0 ip. Subsequent reboots of bootstrap node didn't change the behaviour.

Any subsequent reboot of other nodes (AFTER the bootstrap node was rebooted for the first time AFTER the first lxc was deployed) caused a change in the apiaddress to the lxcbr0 ip in all agents on the rebooted node. So seems to be something happening when the bootstrap node is rebooted for the first time.

Revision history for this message
Paul Gear (paulgear) wrote :

This may be related to bug 1417308.

Revision history for this message
Curtis Hovey (sinzui) wrote :

Per the duplicates:
manual provider does not work if lxc is already installed on the machine selected to be the state-server
Juju will fail to upgrade to 1.21.1 if a restart has happened.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

We need some more information about how to reproduce this and I don't want it to block the 1.21.2 release until then.
I've synced up this with Alexis and Wes.

Revision history for this message
Curtis Hovey (sinzui) wrote :

Per the duplicates:

A. start an instance in a cloud and install lxc on it. With the manual provider, attempt to bootstrap it.

B. The former juju-ci3 was bootstrapped with an older version of juju, the lxc was installed. the upgrade to 1.21.1 failed.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Thanks for the info!

I still think this shouldn't block 1.21.2, so I'm removing the milestone (because 1.21.3 is not yet available).

Revision history for this message
Curtis Hovey (sinzui) wrote :

This job is bootstrapping a machine which has lxc installed. When it starts passing, we will now a fix works
    http://juju-ci.vapour.ws:8080/job/manual-deploy-trusty-ppc64/

Revision history for this message
Curtis Hovey (sinzui) wrote :

We see evidence that the manual-provider test case was broken by a separate change. We can see 1.21 was passing the case until a recent backport of a feature from master. I will split the manual case from this bug.

The case of a working stack switching the apiaddresses to an lxcbr0 address can be seen with these steps with 1.21 on a real cloud.
1. juju bootstrap
2. deploy ubuntu
3. juju ssh 0 "sudo apt-get install -y lxc"
4. reboot -n
The agents will then fail

~$ sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf
apiaddresses:
- 10.0.3.1:17070

$ sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-1/agent.conf
apiaddresses:
- 10.0.3.1:17070

This issue also happen when bootstrapping 1.20, installing lxc, reboot, then upgrade to 1.21. This is my command line for the fewest steps:
juju bootatrap
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - 10.185.145.253:17070
juju ssh 0 "sudo apt-get install -y lxc"
juju ssh 0 "sudo reboot -n"
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - 10.0.3.1:17070

description: updated
Revision history for this message
Curtis Hovey (sinzui) wrote :

I confirmed that manual-provider and 1.21 is indeed broken the same way. With lxc installed on the target state-server:
juju bootstrap
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - 10.0.3.1:17070

description: updated
Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Critical
Curtis Hovey (sinzui)
Changed in juju-core:
importance: Critical → High
assignee: nobody → Dimiter Naydenov (dimitern)
Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Critical
importance: Critical → High
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I've analyzed the problem and used the given steps to reproduce it.
The issue is due to the way we sort and pick cloud-local addresses for machines (incl. the api server). We should filter out any 10.0.x.y addresses we can see on the machine before considering them as usabled, taking into account the existence and contents of /etc/default/lxc-net (e.g. LXC_ADDR="10.0.3.1" and/or whatever address has the bridge specified in LXC_BRIDGE).

This should solve the problem and it could be easily ported to 1.20, 1.21, 1.22, and trunk. I'll start on the fix.

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Fix for 1.21 proposed at https://github.com/juju/juju/pull/1616

Once approved, this fix will be forward ported to 1.22 and 1.23 (trunk).

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Fix landed in 1.21, let's see how all the various CI jobs will react.

I'm assigning the forward porting of the same fix to 1.22 and trunk to James.
James, let's have a chat tomorrow about this.

Changed in juju-core:
status: In Progress → Triaged
assignee: Dimiter Naydenov (dimitern) → James Tunnicliffe (dooferlad)
Revision history for this message
Curtis Hovey (sinzui) wrote :

The last failure was caused by a dirty machine. We found mongod from the failed runs running. After cleaning the machine, we can see fixed juju passed and it cleans up mongod.

Changed in juju-core:
status: Triaged → In Progress
Changed in juju-core:
status: In Progress → Fix Committed
status: Fix Committed → In Progress
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Sacha Yunusic (sacha-m) wrote :

I have the same behavior. After the reboot, it got the lxcbr0 IP address instead of the eth0 one.
I have Installed 1.20.14-0ubuntu1~14.04.1~juju1.
What if I manually change apiaddresses value in /var/lib/juju/agents/machine-0-lxc-2/agent.conf and reboot? That would solve the problem?
BTW, all services are running (sudo juju status | grep agent-state)

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

No Sacha, it won't help as they will get overwritten.

I'd suggest upgrading to 1.21.3 from the Juju stable releases PPA https://launchpad.net/~juju/+archive/ubuntu/stable

Revision history for this message
Sacha Yunusic (sacha-m) wrote :

I updated to 1.21.3. That fixed the problem. :)

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Another happy user :)

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.23 → 1.23-beta1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.