juju-core

juju agent using lxcbr0 address as apiaddress instead of juju-br0 breaks agents

Bug #1416928 reported by Fabricio Costi on 2015-02-02

This bug affects 7 people

	Status	Importance	Assigned to	Milestone
juju-core	Fix Released	High	James Tunnicliffe	juju-core 1.23-beta1
1.21	Fix Released	Critical	Dimiter Naydenov	juju-core 1.21.3
1.22	Fix Released	Critical	James Tunnicliffe	juju-core 1.22-beta4

Bug Description

juju-core 1.21.1

node 0: bootstrap, lxc/0-2
    - juju-br0 10.10.18.51 (eth0)
    - lxcbr0 10.0.3.1
    - eth1 in promisc mode
node 1: lxc/0-6
    - juju-br0 10.10.18.52 (eth0)
    - lxcbr0 10.0.3.1
    - virbr0 192.168.122.1
    - eth1 in promisc mode
node 2: metal only
    - juju-br0 10.10.18.53 (eth0)
    - virbr0 192.168.122.1
    - eth1 in promisc mode

- on node 0, physical and lxc machines and all unit agents have the ip from node 0 lxcbr0 bridge assigned to apiaddress in agent.conf. They report correctly their state.

- on node 1, physical and lxc machine and all unit agents have the node 0 lxcbr0 ip assigned to apiaddress in agent.conf. They can't reach WSS.

- on node 2, physical machine agent have the node 0 lxcbr0 ip assigned to apiaddress in agent.conf. Can't reach WSS.

I am assuming all agents are getting the lxcbr0 ip from node 0 as node 2 do not have an lxcbr0 and yet it has the 10.0.3.1 ip assigned to its apiaddress in agent.conf.

Manually changing apiaddress on all agents to juju-br0 ip of node 0, stopping and starting all the agents temporarily solved the issue.

After reboot, same problem happens as agent.conf is overwritten by juju on all machines, lxc and units.

To reproduce with 1.21 in a cloud:
juju bootstrap
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - REAL_ETH0_ADDRESS:17070
juju ssh 0 "sudo apt-get install -y lxc"
juju ssh 0 "sudo reboot -n"
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - 10.0.3.1:17070

Or with manual-provider with lxc installed on the target state-server:
juju bootstrap
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
apiaddresses:
- 10.0.3.1:17070

See original description

Tags:

Fabricio Costi (fabricio-9) on 2015-02-02

description:	updated
summary:	- juju agent using lxcbr0 address as apiaddress instead of juju-br0 after - reboot + juju agent using lxcbr0 address as apiaddress instead of juju-br0 breaks + agents

Fabricio Costi (fabricio-9) on 2015-02-02

description:

updated

Curtis Hovey (sinzui) on 2015-02-02

tags:	added: lxc network
tags:	added: api

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-03:

machine-6.log Edit (333.4 KiB, text/plain)

I'm seeing this error on new deploys on juju 1.21.1-trusty-amd64 from the stable PPA. I've attached an extract of the machine log showing it changing the API addresses from the correct values to the wrong values.?field.comment=I'm seeing this error on new deploys on juju 1.21.1-trusty-amd64 from the stable PPA. I've attached an extract of the (redacted) machine log showing it changing the API addresses from the correct value (192.168.99.0) to the wrong value (10.0.3.1).

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-03:

(Apologies for the repeated comment above - no idea what happened there.)

There are corresponding messages on machine 0, where it shows:

2015-02-03 01:17:36 INFO juju.worker.machiner machiner.go:92 setting addresses for machine-0 to ["local-machine:127.0.0.1" "local-cloud:192.168.99.0" "local-cloud:10.0.3.1" "local-machine:::1"]
...
2015-02-03 01:17:36 DEBUG juju.apiserver apiserver.go:156 <- [6D] machine-0 {"RequestId":21,"Type":"Machiner","Request":"SetMachineAddresses","Params":{"MachineAddresses":[{"Tag":"machine-0","Addresses":[{"Value":"127.0.0.1","Type":"ipv4","NetworkName":"","Scope":"local-machine"},{"Value":"192.168.99.0","Type":"ipv4","NetworkName":"","Scope":"local-cloud"},{"Value":"10.0.3.1","Type":"ipv4","NetworkName":"","Scope":"local-cloud"},{"Value":"::1","Type":"ipv6","NetworkName":"","Scope":"local-machine"}]}]}}

Curtis Hovey (sinzui) on 2015-02-03

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 1.23

Revision history for this message

Fabricio Costi (fabricio-9) wrote on 2015-02-04:

This happens only AFTER the bootstrap node is rebooted for the first time.

On a new env I was able to reboot all other nodes without a problem - apiaddress still points to the right IP of bootstrap node - provided the bootstrap node hadn't been rebooted after the first lxc was deployed.

By inspecting the agent.conf from LXC running on the bootstrap node I could see the apiaddress is pointing to the right ip BEFORE reboot.

AFTER the first reboot of bootstrap node, the apiaddress was changed to the lxcbr0 ip. Subsequent reboots of bootstrap node didn't change the behaviour.

Any subsequent reboot of other nodes (AFTER the bootstrap node was rebooted for the first time AFTER the first lxc was deployed) caused a change in the apiaddress to the lxcbr0 ip in all agents on the rebooted node. So seems to be something happening when the bootstrap node is rebooted for the first time.

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-04:

This may be related to bug 1417308.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-02-06:

Per the duplicates:
manual provider does not work if lxc is already installed on the machine selected to be the state-server
Juju will fail to upgrade to 1.21.1 if a restart has happened.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-09:

We need some more information about how to reproduce this and I don't want it to block the 1.21.2 release until then.
I've synced up this with Alexis and Wes.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-02-09:

Per the duplicates:

A. start an instance in a cloud and install lxc on it. With the manual provider, attempt to bootstrap it.

B. The former juju-ci3 was bootstrapped with an older version of juju, the lxc was installed. the upgrade to 1.21.1 failed.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-09:

Thanks for the info!

I still think this shouldn't block 1.21.2, so I'm removing the milestone (because 1.21.3 is not yet available).

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-02-09:

This job is bootstrapping a machine which has lxc installed. When it starts passing, we will now a fix works
http://juju-ci.vapour.ws:8080/job/manual-deploy-trusty-ppc64/

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-02-11:

#10

We see evidence that the manual-provider test case was broken by a separate change. We can see 1.21 was passing the case until a recent backport of a feature from master. I will split the manual case from this bug.

The case of a working stack switching the apiaddresses to an lxcbr0 address can be seen with these steps with 1.21 on a real cloud.
1. juju bootstrap
2. deploy ubuntu
3. juju ssh 0 "sudo apt-get install -y lxc"
4. reboot -n
The agents will then fail

~$ sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf
apiaddresses:
- 10.0.3.1:17070

$ sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-1/agent.conf
apiaddresses:
- 10.0.3.1:17070

This issue also happen when bootstrapping 1.20, installing lxc, reboot, then upgrade to 1.21. This is my command line for the fewest steps:
juju bootatrap
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - 10.185.145.253:17070
juju ssh 0 "sudo apt-get install -y lxc"
juju ssh 0 "sudo reboot -n"
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
  apiaddresses:
  - 10.0.3.1:17070

description:

updated

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-02-11:

#11

I confirmed that manual-provider and 1.21 is indeed broken the same way. With lxc installed on the target state-server:
juju bootstrap
juju ssh 0 "sudo grep -A 1 apiaddresses /var/lib/juju/agents/machine-0/agent.conf"
apiaddresses:
- 10.0.3.1:17070

description:

updated

Curtis Hovey (sinzui) on 2015-02-11

Changed in juju-core:
importance:	High → Critical

Curtis Hovey (sinzui) on 2015-02-11

Changed in juju-core:
importance:	Critical → High
assignee:	nobody → Dimiter Naydenov (dimitern)

Curtis Hovey (sinzui) on 2015-02-11

Changed in juju-core:
importance:	High → Critical
importance:	Critical → High

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-16:

#12

I've analyzed the problem and used the given steps to reproduce it.
The issue is due to the way we sort and pick cloud-local addresses for machines (incl. the api server). We should filter out any 10.0.x.y addresses we can see on the machine before considering them as usabled, taking into account the existence and contents of /etc/default/lxc-net (e.g. LXC_ADDR="10.0.3.1" and/or whatever address has the bridge specified in LXC_BRIDGE).

This should solve the problem and it could be easily ported to 1.20, 1.21, 1.22, and trunk. I'll start on the fix.

Changed in juju-core:
status:	Triaged → In Progress

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-17:

#13

Fix for 1.21 proposed at https://github.com/juju/juju/pull/1616

Once approved, this fix will be forward ported to 1.22 and 1.23 (trunk).

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-17:

#14

Fix landed in 1.21, let's see how all the various CI jobs will react.

I'm assigning the forward porting of the same fix to 1.22 and trunk to James.
James, let's have a chat tomorrow about this.

Changed in juju-core:
status:	In Progress → Triaged
assignee:	Dimiter Naydenov (dimitern) → James Tunnicliffe (dooferlad)

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-02-17:

#16

The last failure was caused by a dirty machine. We found mongod from the failed runs running. After cleaning the machine, we can see fixed juju passed and it cleans up mongod.

James Tunnicliffe (dooferlad) on 2015-02-18

Changed in juju-core:
status:	Triaged → In Progress

James Tunnicliffe (dooferlad) on 2015-02-19

Changed in juju-core:
status:	In Progress → Fix Committed
status:	Fix Committed → In Progress

James Tunnicliffe (dooferlad) on 2015-02-19

Changed in juju-core:
status:	In Progress → Fix Committed

Revision history for this message

Sacha Yunusic (sacha-m) wrote on 2015-02-25:

#17

I have the same behavior. After the reboot, it got the lxcbr0 IP address instead of the eth0 one.
I have Installed 1.20.14-0ubuntu1~14.04.1~juju1.
What if I manually change apiaddresses value in /var/lib/juju/agents/machine-0-lxc-2/agent.conf and reboot? That would solve the problem?
BTW, all services are running (sudo juju status | grep agent-state)

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-25:

#18

No Sacha, it won't help as they will get overwritten.

I'd suggest upgrading to 1.21.3 from the Juju stable releases PPA https://launchpad.net/~juju/+archive/ubuntu/stable

Revision history for this message

Sacha Yunusic (sacha-m) wrote on 2015-02-25:

#19

I updated to 1.21.3. That fixed the problem. :)

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-02-25:

#20

Another happy user :)

Curtis Hovey (sinzui) on 2015-03-09

Changed in juju-core:
status:	Fix Committed → Fix Released

Curtis Hovey (sinzui) on 2015-03-09

Changed in juju-core:
milestone:	1.23 → 1.23-beta1

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

machine-6.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.