juju-core

LXC containers getting HA VIP addresses after reboot

Bug #1516150 reported by Peter Sabaini on 2015-11-13

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	Dimiter Naydenov	Canonical Juju 2.0-beta16
juju-core	Won't Fix	Undecided	Unassigned
1.25	Won't Fix	Undecided	Unassigned

Bug Description

Due to network driver issues a juju-db node in a HA configuration had to be rebooted. After the node came back up, several units had their public addresses set incorrectly. Those units were part of a corosync HA cluster. For some of the units, the public address has been set to the corosync VIP that should normally float between nodes, instead of their normal machine address. This prevented corosync from starting up and therefore made the service inaccessible. Rebooting the containers set the unit addresses back to their correct value.

The behaviour is similar to the one described in bug #1463480, where VIP addresses had been set in the course of a juju upgrade.; only here no upgrade took place, just a node reboot.

Juju version: 1.22.6-trusty-amd64

Tags:

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-11-13:

I don't think, for bug #1463480, that an upgrade was required to hit this problem. I think just rebooting the node was enough for the IP to switch. Going to just double check with mfoord.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-11-13:

I suspected that this might be related. However one notable diff I should point out: for bug #1463480 to hit, the HA VIP has to sort first, before the "regular" ie. correct addresses. In this case, the VIPs do sort _after_ the regular addresses (we've moved our default VIP block in honour of the above bug).

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-11-13:

Thanks for the additional info, Peter. Can you attach the machine logs?

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-11-16:

Here's the logs from machine-0 and machine-1 (the one that was rebooted):
https://chinstrap.canonical.com/~sabaini/2015-11-13-jujureboot-ha-issue

Cheryl Jennings (cherylj) on 2015-11-16

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High

Curtis Hovey (sinzui) on 2016-05-09

tags:

added: juju-reboot

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-05-11:

While this seems to have been fixed for the non-container case, I can still reproduce this problem with 1.25.5 and 2.0-beta7.

Steps:
- Deploy a service to a container
- Add an additional IP to eth0: sudo ip addr add 9.9.9.9 dev eth0 (it is important that the IP is lower than the current IP on eth0)
- Restart the jujud process on the container

You'll see that the IP is switched:
      0/lxc/0:
        juju-status:
          current: started
          since: 11 May 2016 11:58:28-05:00
          version: 2.0-beta7.1
        dns-name: 9.9.9.9

Changed in juju-core:
milestone:	none → 2.0-beta7

Curtis Hovey (sinzui) on 2016-05-13

Changed in juju-core:
milestone:	2.0-beta7 → 2.0-beta8

Cheryl Jennings (cherylj) on 2016-05-26

Changed in juju-core:
milestone:	2.0-beta8 → 2.0-beta9

Curtis Hovey (sinzui) on 2016-06-16

Changed in juju-core:
milestone:	2.0-beta9 → 2.0-beta10

Curtis Hovey (sinzui) on 2016-06-24

Changed in juju-core:
milestone:	2.0-beta10 → 2.0-beta11

Curtis Hovey (sinzui) on 2016-07-01

Changed in juju-core:
milestone:	2.0-beta11 → 2.0-beta12

Cheryl Jennings (cherylj) on 2016-07-05

Changed in juju-core:
milestone:	2.0-beta12 → 2.0-beta13

Revision history for this message

John A Meinel (jameinel) wrote on 2016-07-12:

The machine agent looks at the IP addresses on the network devices at startup. So it shouldn't require an instance reboot, just a simple restart of the jujud agent. (restart jujud-machine-X, IIRC)

I thought we would have some of the same fixes in place for containers that we had for machines, but here are my thoughts on it:
  1. Check your IP addresses more regularly, so that when you lose the VIP you notice and can remove the VIP from your list. However, their charms would have to deal with the IP address changing, which I'm not sure if they are equipped to do.
  2. Hysteresis where we remember the last address we gave and continue to give that address. This requires making sure the machine agent starts up at some point where we don't have the VIP, so that it will start reporting the non-VIP address, and then future reboots of the agent should be ok. (I thought we did this one)
  3. Space awareness, where we can notice that the VIP address is not in the space that we want the unit to be reporting from. This is potentially risky depending on how they are defining the spaces and what VIP address they are giving. (If they say the subnet is 10.0.0/24 but then give 10.0.0.254 as the VIP then we can't tell it isn't in the right subnet.) I don't know how they are picking VIP, I believe they are picking 'exclusion' ranges in MAAS, but I don't know how that overlaps with the subnets that they are defining.
  4. Stop using detected addresses entirely, in favor of Provider reported addresses. We may need to vet this a bit more, but since we now use the Device mechanism from MAAS to statically allocate an IP address, we should be able to know the IP address without waiting for the machine to come up. We were using detection to handle the fact that DHCP allocated addresses weren't known in advance. (do we have an 'ignore-machine-addresses' flag that was intended to do this, but might have a different bug in it?)

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-07-12:

IIRC, the ignore-machine-addresses flag is ignored for containers (see bug 1509292).

Curtis Hovey (sinzui) on 2016-07-22

Changed in juju-core:
milestone:	2.0-beta13 → 2.0-beta14

Curtis Hovey (sinzui) on 2016-08-04

Changed in juju-core:
milestone:	2.0-beta14 → 2.0-beta15

Dimiter Naydenov (dimitern) on 2016-08-08

Changed in juju-core:
status:	Triaged → In Progress
assignee:	nobody → Dimiter Naydenov (dimitern)

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2016-08-09:

I managed to isolate the problem, and it's not due to addresses sort order. It's because the HA VIP has a "public" scope (e.g. 109.106.176.12 or 9.9.9.9 as described in comment #5), while the other addresses have "local-cloud" scope (RFC 1918 - e.g. 172.16.150.107). What you see in "dns-name" is taken from the preferred public address of the machine (returned also by 'unit-get public-address'). Addresses sort order only matters within the same scope.

To avoid hitting the described issue, I *strongly* suggest not using non-RFC1918 ("public") IPs for HA VIPs. I can confirm using the steps in comment #5 and any IP from the same subnet (sorting lower OR higher than the existing one on eth0) and restarting jujud *will not* change the "dns-name" for the machine in status.

To fix the issue properly the corosync charm can be changed to use network-get with a given binding, which is reserved for clustering/HA endpoints and *not* to use 'unit-get private-address' or 'unit-get public-address'. We should discuss the details how to do that.

Changed in juju-core:
status:	In Progress → Invalid
assignee:	Dimiter Naydenov (dimitern) → nobody
assignee:	nobody → Dimiter Naydenov (dimitern)

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2016-08-09:

Not an actual fix, but at least to make the situation less surprising I'll propose a patch adding extra logging if and when the preferred private/public addresses of a machine change.

Changed in juju-core:
status:	Invalid → In Progress

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2016-08-10:

#10

We _are_ using RFC1918 addresses for HA VIP as a rule. This should also be visible in the logs mentioned above. Note due to infra changes the log url is now https://private-fileshare.canonical.com/~sabaini/2015-11-13-jujureboot-ha-issue/

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2016-08-10:

#11

Sorry Peter, it's not clear to me what ranges are you using for VIPs from the logs. I've seen both e.g. 109.106.176.12 and 172.16.150.107.

Could you describe the networks you are using please?

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2016-08-11:

#12

Here's a patch that will add extra logging at INFO level whenever the preferred private or public addresses of a machine change (typically picked up after host reboots or the running jujud is restarted). It should help to track when and why the unexpected changes to HA VIPs happen.

https://github.com/juju/juju/pull/5980

This should be part of the next beta (15).

Richard Harding (rharding) on 2016-08-11

Changed in juju-core:
milestone:	2.0-beta15 → 2.0-beta16

Richard Harding (rharding) on 2016-08-16

Changed in juju-core:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta16 → none
milestone:	none → 2.0-beta16

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

Changed in juju-core:
status:	New → Won't Fix

Curtis Hovey (sinzui) on 2016-08-25

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.