Bug #1463480 “Failed upgrade, mixed up HA addresses” : Bugs : juju-core

Peter Sabaini (peter-sabaini) on 2015-06-09

description:

updated

Katherine Cox-Buday (cox-katherine-e) on 2015-06-09

Changed in juju-core:
milestone:	none → 1.25.0

Katherine Cox-Buday (cox-katherine-e) on 2015-06-09

Changed in juju-core:
importance:	Undecided → High

Katherine Cox-Buday (cox-katherine-e) on 2015-06-09

tags:

added: blocker

Eric Snow (ericsnowcurrently) on 2015-06-09

Changed in juju-core:
status:	New → In Progress
assignee:	nobody → Eric Snow (ericsnowcurrently)

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2015-06-09:

#1

Do you have any logs for machine 0/lxc/0 that we could take a look at?

Curtis Hovey (sinzui) on 2015-06-09

tags:

added: ha upgrade-juju

Eric Snow (ericsnowcurrently) on 2015-06-09

Changed in juju-core:
assignee:	Eric Snow (ericsnowcurrently) → nobody
status:	In Progress → Triaged

Revision history for this message

Katherine Cox-Buday (cox-katherine-e) wrote on 2015-06-09:

#2

In this context, is HA Juju's concept of HA? Or MySQL's?

Changed in juju-core:
status:	Triaged → Incomplete

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-09:

#3

machine-0_var-log.tar.xz Edit (119.9 MiB, application/octet-stream)

Uploading /var/log of machine-0

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2015-06-09:

#4

I took a look but really don't have enough familiarity with how upgrade works to make much progress without logs.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-09:

#5

machine-1_var-log.tar.xz Edit (91.6 MiB, application/octet-stream)

Uploading /var/log of machine-1

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-09:

#6

HA as implemented by the hacluster charm, ie. corosync/pacemaker

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-10:

#7

I failed to mention yesterday that keystone and possibly other
HA-ified services had similar issues as mysql.

Also, I've retried an upgrade today:

a. Upgrading from 1.20.14 --> 1.22.5.

b. Issue lp:1441478 struck again

c. On the mysql units I could observe a similar pattern as yesterday, after ~10min I could see the HA VIP on mysql/0 https://pastebin.canonical.com/132916/

d. However! After another ~10mins the mysql service partially recovered, mysql/0 got it's proper ip back. Still the mysql-hacluster subordinate still was down, and the config-changed hook hung:
'''2015-06-10 08:43:35 INFO juju.utils.fslock fslock.go:146 attempted lock failed "uniter-hook-execution", mysql/0: running hook config-changed, currently held: mysql-hacluster/1: running hook "config-changed"'''
This didn't change after 1h+

e. Also, the keystone service had the similar issues after 1h+, ie. the HA VIP on keystone/1; note the agent-versions: https://pastebin.canonical.com/132926/

I've dumped /var/lib and /var/log from machines-{1,2,3} and can make them available.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-10:

#8

Forgot to add status of mysql after partial resolve:

      mysql/0:
        agent-state: started
        agent-version: 1.22.5
        machine: 0/lxc/0
        public-address: 172.20.172.30
        subordinates:
          mysql-hacluster/1:
            upgrading-from: local:trusty/hacluster-68
            agent-state: down
            agent-state-info: (started)
            agent-version: 1.20.14.1
            public-address: 172.20.172.30
      mysql/1:
        agent-state: started
        agent-version: 1.22.5
        machine: 1/lxc/4
        public-address: 172.20.172.38
        subordinates:
          mysql-hacluster/2:
            upgrading-from: local:trusty/hacluster-68
            agent-state: down
            agent-state-info: (started)
            agent-version: 1.22.5
            public-address: 172.20.172.38
      mysql/2:
        agent-state: started
        agent-version: 1.22.5
        machine: 2/lxc/4
        public-address: 172.20.172.78
        subordinates:
          mysql-hacluster/0:
            upgrading-from: local:trusty/hacluster-68
            agent-state: started
            agent-version: 1.22.5
            public-address: 172.20.172.78

Revision history for this message

Eric Snow (ericsnowcurrently) wrote on 2015-06-10:

#9

Sorry we haven't been able to look at this closely yet, Peter. We'll be doing so today and hopefully have answers for you soon.

Changed in juju-core:
status:	Incomplete → Triaged

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-06-10:

#10

Looking at the logs, I see that machine-0-lxc-0 tried to upgrade, but failed to fetch the tools:

2015-06-09 10:48:25 ERROR juju.worker.upgrader upgrader.go:157 failed to fetch tools from "https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64": Get https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64: dial tcp 10.0.3.1:17070: connection timed out

After that, it was never able to connect to the state server, but is dialing a different address than when it had connection previously:

2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"

Other containers failed to fetch tools from 10.0.3.1, but tried other servers and succeeded. Somehow the addresses for the servers got messed up. Still looking...

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-06-11:

#11

I think this might be a dup of bug #1416928.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-11:

#12

Cheryl, I've briefly looked at bug #1416928 because I wondered the same thing. As a test I've rebooted machine-0 and got this in /var/lib/juju/agents/machine-0/agent.conf:

apiaddresses:
- 172.20.168.2:17070
- 172.20.168.3:17070
- 172.20.168.4:17070

I.e. the priv_net addresses, not lxcbr0.

Also, shouldn't that bug be fixed in 1.22.1 and 1.22.5?

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-06-11:

#13

When machine-0 encountered this problem, it hadn't updated yet to 1.22, so it was still on 1.20.14.

Also, regarding the overwriting of the IP, I see in the log that the IP wasn't overwritten, but rather assigned multiple values, and juju status most likely just picked one to display (IIRC there was a bug not too long ago regarding the ordering of IPs displayed in juju status?):

2015-06-09 10:46:17 INFO juju.worker.machiner machiner.go:86 setting addresses for machine-0-lxc-0 to ["local-machine:127.0.0.1" "local-cloud:172.20.171.204" "local-cloud:172.20.168.104" "local-machine:::1" "fe80::216:3eff:fed6:252a"]

Going to pull in some folks with more networking knowledge to take a look.

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2015-06-11:

#14

10.0.3.1 is suspicious since that is what lxc will pick for a bridge, so the theory that bug #1416928 is at least related seems reasonable. It looks like the URL that is being provided by the upgrader Tools API call (which is just common.ToolsGetter). I would need to have a look at what data is coming back from juju/apiserver/common/tools.go -> findMatchingTools to confirm that this is where the URL is coming from, then track down how it got into toolstorage or simplestreams. Suggest that we get someone familiar with how that is populated in the conversation.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-06-11:

#15

Looking at the logs I think the issue is around not having an upgrade step form 1.20.x to 1.22.x which filters out any lxcbr0 addresses from the API hosts/ports. 10.0.3.1 was among the list of API hostsPorts along with quite a lot of IPv6 link-local addresses:

https://pastebin.canonical.com/133025/

So: 1) to fix the issue some mongo surgery is needed (to salvage the environment) - drop all 10.0.3.0/24 and fe80::/64 addresses from the hostPorts (stateServers?) collections; 2) to make sure it doesn't happen again - do an upgrade step, run on the DatabaseMaster which gets current API hostPorts converts them to []network.Address for each server (network.HostsWithoutPorts?) and calls network.FilterLXCAddresses() on these slices, and sets the filtered api hosts/ports back in mongo.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-06-11:

#16

Looking at this a bit more I'm getting more convinced that the upgrade failure is due to bug #1416928. I see that all of the containers I've sampled attempt to get the tools from the 10.0.3.1 address. They eventually succeed where machine-0-lxc-0 fails because they get an update that corrects the apiserver IPs to not include 10.0.3.1 (presumably because the state servers have been updated).

However, on machine-0-lxc-0, the watcher's connection to the state server dies before it gets the update:
2015-06-09 10:48:25 ERROR juju.worker.upgrader upgrader.go:157 failed to fetch tools from "https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64": Get https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 10:48:30 INFO juju.worker.upgrader upgrader.go:134 upgrade requested from 1.20.14.1-trusty-amd64 to 1.22.1
2015-06-09 11:05:01 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
...
2015-06-09 11:05:01 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-09 11:05:01 INFO juju.cmd.jujud agent.go:177 error pinging *api.State: connection is shut down
2015-06-09 11:05:01 ERROR juju.worker runner.go:207 fatal "upgrader": error receiving message: read tcp 172.20.168.4:17070: connection timed out

Since machine-lxc-0 is still running 1.20.14, it doesn't filter out the 10.0.3.1 addresses when it tries to reconnect to the state servers:
2015-06-09 11:05:01 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-06-09 11:05:04 INFO juju.worker runner.go:260 start "api"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"

At this point, it never reconnects to the state servers because it's using the wrong IP. The fix for this would be the fix that's been released for bug #1416928.

Looking at this a bit more I'm getting more convinced that the upgrade failure is due to bug #1416928.  I see that all of the containers I've sampled attempt to get the tools from the 10.0.3.1 address.  They eventually succeed where machine-0-lxc-0 fails because they get an update that corrects the apiserver IPs to not include 10.0.3.1 (presumably because the state servers have been updated).

However, on machine-0-lxc-0, the watcher's connection to the state server dies before it gets the update:
2015-06-09 10:48:25 ERROR juju.worker.upgrader upgrader.go:157 failed to fetch tools from "https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64": Get https://10.0.3.1:17070/environment/c0b9fa19-1546-4fad-8bd9-06f8926f717c/tools/1.22.1-trusty-amd64: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 10:48:30 INFO juju.worker.upgrader upgrader.go:134 upgrade requested from 1.20.14.1-trusty-amd64 to 1.22.1
2015-06-09 11:05:01 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
...
2015-06-09 11:05:01 ERROR juju.state.api.watcher watcher.go:68 error trying to stop watcher: connection is shut down
2015-06-09 11:05:01 INFO juju.cmd.jujud agent.go:177 error pinging *api.State: connection is shut down
2015-06-09 11:05:01 ERROR juju.worker runner.go:207 fatal "upgrader": error receiving message: read tcp 172.20.168.4:17070: connection timed out

Since machine-lxc-0 is still running 1.20.14, it doesn't filter out the 10.0.3.1 addresses when it tries to reconnect to the state servers:
2015-06-09 11:05:01 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-06-09 11:05:04 INFO juju.worker runner.go:260 start "api"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://10.0.3.1:17070/": websocket.Dial wss://10.0.3.1:17070/: dial tcp 10.0.3.1:17070: connection timed out
2015-06-09 11:07:11 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://10.0.3.1:17070/"

At this point, it never reconnects to the state servers because it's using the wrong IP.  The fix for this would be the fix that's been released for bug #1416928.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-06-12:

#17

Peter, I chatted with some folks about this again this afternoon, and there's not really anything we can do from a code change perspective since the upgrade failure is caused by a problem fixed in 1.22, and we are not releasing any more updates to 1.20. The one container, machine-0-lxc-0, just failed to update as the watcher lost its connection with the state server before getting the correct state server addresses.

If you hit this again, you should be able to manually modify the agent.conf file to remove the 10.0.3.1 address. I suspect they didn't show up as you mentioned in comment #12 because machine-0 at that time had been updated to 1.22, which filters them appropriately.

Regarding the HA VIP address, juju status picks one of the addresses associated with the machine to display as the public address, and I can see in the logs that machine-0-lxc-0 had both 172.20.171.204 and 172.20.168.104 assigned to it. Should it not have both?

I'll be around in the morning my time tomorrow if you want to chat on IRC about this.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-12:

#18

Re. the HA VIP address, I'm not really sure how the internals work here, but aiui the HA address is freely floating between instances and shouldn't be directly associated with a machine. I've never seen the HA address show up as the public-address of a container, fwiw. As an admin, it would be fairly surprising to me to get bounced to a different machine, just because the VIP has moved.

This is managed by the hacluster charm via the "ha" relation; not sure if this directly associated with a machine at all?

Curtis Hovey (sinzui) on 2015-06-15

no longer affects:

juju-core/1.22

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-06-15:

#19

From what I've read on the HA VIP with mysql, I would expect that the HA VIP would be assigned to one of the machines in the cluster at all times and would fail over to another machine should the first one become unresponsive. I'm just double checking that this is the expected behavior.

Once this HA VIP issue gets resolved, I'll close this bug out as a DUP of bug #1416928. We aren't able to backport that fix to 1.20, but there is a workaround of manually modifying the agent.conf file should you run into this again.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-16:

#20

Yes, the HA VIP would be assigned to one of the cluster nodes at all times -- but IMO it shouldn't show up as the juju "public-address" of a container, because it just isn't. Eg. one of the upshots of the VIP leakage is that the HA VIP shows up as the node address in corosync.conf, which corosync fails horribly on, as a VIP can't be a node address at the same time.

Regarding the lxcbr0 address, this is probably #1416928. When watching /var/lib/juju/agents/*/agent.conf one can see it gets set to some 10.0.3.0/24 for a short time, before getting reset to the correct metal addresses; likely the failing containers start in that time frame.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-16:

#21

FTR, i've now had some success in juju-upgrading by stopping all HA resources prior to upgrading:

1) stop all hacluster resources
2) trigger upgrade
3) apply mongo hackery as per lp:1441478, restarting primary state server
4) reboot hanging lxc containers (check for hanging jujud-machine daemons)
5) start hacluster resources (check that resources started cleanly)
6) resolve -r error'ed juju units
7) fix broken netns as necessary (cf. lp:1350947)

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-06-17:

#22

Apologies for the delay in getting back to you, but I didn't get notification of your updates.

At this point, I need to pull in someone who has more networking knowledge to help out. I'll email him this evening and get back with you tomorrow.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-06-17:

#23

Please, provide logs with the issues you're facing.

Revision history for this message

James Troup (elmo) wrote on 2015-06-17:

#24

Dimiter, logs were provided earlier in the bug. If you need more, newer or different logs, that's fine, but please be more specific about what it is you need.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-18:

#25

Update - that workaround in comment #21 turned out to be unreliable. We've had success on one upgrade, but on another one we see VIPs leaking into public-addresses

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2015-06-19:

#26

It looks like 0-lxc-0 lost its connection to state during the upgrade to 1.22. It tried to download tools from a location that didn't work and didn't try anywhere else. 0-lxc-1 encountered the same problem, but then tried another tools location. I haven't worked out why there is a difference between the two containers, but that would be my next step.

<dooferlad> so, looking at machine-0-lxc-0.log, after the upgrade Juju sees "local-cloud:172.20.171.204" "local-cloud:172.20.168.104"
<dooferlad> this is the same as before the upgrade
<chrome0> where local-cloud:172.20.168.104 is the VIP
<dooferlad> yes
<dooferlad> is there basic network connectivity after the upgrade? Is it a case of "something picked the wrong IP address from that pair"?
<chrome0> basic network conn. is there, yes, if not using the VIP
<chrome0> and "something picked the wrong IP from that pair" sounds right as well :-)
<dooferlad> so the VIP isn't usable inside the cluster? Just outside the cluster it does work?
<chrome0> the VIP should be usable from the cluster, but i remember not being able to ssh' to it
<chrome0> "juju ssh
<chrome0> that is
<dooferlad> so, 172.20.168.104 doesn't show up in the unit-mysql-0.log after the upgrade, neither does 172.20.171.204.
<dooferlad> in fact it looks like 0-lxc-0 doesn't actually come back up with network connectivity
<dooferlad> so I would save "ip route -n" pre and post upgrade for machine 0 and machin 0-lxc-0
<dooferlad> and also "sudo iptables-save" (pre and post)
<chrome0> on staging we've had to reboot lxc's because of https://bugs.launchpad.net/juju-core/+bug/1416928
<dooferlad> but that should have been fixed after the upgrade
<chrome0> apparently it hit during/before the upgrade...
<chrome0> aiui
<dooferlad> OK, so from machine-0-lxc-0.log, lne 16203 onwards we are trying to upgrade, but fail because we can't download the tools. So, that unit didn't upgrade at all.
<dooferlad> machine-0, machine-1 and all other LXCs got the new jujud.
<dooferlad> so if the VIP was pointing at machine-0-lxc-0, then that is a big problem. The VIP needed to point to another machine when 0-lxc-0 didn't upgrade
<dooferlad> Both 0-lxc-0 and 0-lxc-1 tried to download from the same location at the same time, both failed, only 0-lxc-1 tried another location.
<dooferlad> it looks like 0-lxc-0 lost its connection to state and crapped out.

It looks like 0-lxc-0 lost its connection to state during the upgrade to 1.22. It tried to download tools from a location that didn't work and didn't try anywhere else. 0-lxc-1 encountered the same problem, but then tried another tools location. I haven't worked out why there is a difference between the two containers, but that would be my next step.

<dooferlad> so, looking at machine-0-lxc-0.log, after the upgrade Juju sees "local-cloud:172.20.171.204" "local-cloud:172.20.168.104"
<dooferlad> this is the same as before the upgrade
<chrome0> where local-cloud:172.20.168.104 is the VIP
<dooferlad> yes
<dooferlad> is there basic network connectivity after the upgrade? Is it a case of "something picked the wrong IP address from that pair"?
<chrome0> basic network conn. is there, yes, if not using the VIP 
<chrome0> and "something picked the wrong IP from that pair" sounds right as well :-)
<dooferlad> so the VIP isn't usable inside the cluster? Just outside the cluster it does work?
<chrome0> the VIP should be usable from the cluster, but i remember not being able to ssh' to it
<chrome0> "juju ssh
<chrome0> that is
<dooferlad> so, 172.20.168.104 doesn't show up in the unit-mysql-0.log after the upgrade, neither does 172.20.171.204.
<dooferlad> in fact it looks like 0-lxc-0 doesn't actually come back up with network connectivity
<dooferlad> so I would save "ip route -n" pre and post upgrade for machine 0 and machin 0-lxc-0
<dooferlad> and also "sudo iptables-save" (pre and post)
<chrome0> on staging we've had to reboot lxc's because of https://bugs.launchpad.net/juju-core/+bug/1416928
<dooferlad> but that should have been fixed after the upgrade
<chrome0> apparently it hit during/before the upgrade... 
<chrome0> aiui 
<dooferlad> OK, so from machine-0-lxc-0.log, lne 16203 onwards we are trying to upgrade, but fail because we can't download the tools. So, that unit didn't upgrade at all.
<dooferlad> machine-0, machine-1 and all other LXCs got the new jujud.
<dooferlad> so if the VIP was pointing at machine-0-lxc-0, then that is a big problem. The VIP needed to point to another machine when 0-lxc-0 didn't upgrade
<dooferlad> Both 0-lxc-0 and 0-lxc-1 tried to download from the same location at the same time, both failed, only 0-lxc-1 tried another location.
<dooferlad> it looks like 0-lxc-0 lost its connection to state and crapped out.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-06-19:

#27

Marking as affecting hacluster in the openstack charms as there may be a change we can use to prevent using the VIP in this scenario as a sanity check.

A work around discussed with jillrouleau is to force modify the hacluster template to include the IP addresses and node assignments.

Revision history for this message

Jill Rouleau (jillrouleau) wrote on 2015-06-19:

#28

Regarding the HA VIP assignment. On a fresh deploy of 1.24 we had this happen with 2 openstack services. For example with glance: https://pastebin.canonical.com/133635/ where 172.20.168.102 is our VIP. Interestingly on this run the config-changed hook error and VIP assignment appear to have happened about 30 minutes after the deploy completed but unfortunately I did not capture a paste of this in a non-errored state or when it exactly happened so I absolutely could have missed something there. Full logs from all 3 metals, the resultant glance corosync.conf files, and sosreports from all 3 metals are at https://chinstrap.canonical.com/~jillr/lp1463480/. Our bundle and deploy script are also there. The addressing of all glance units in this deploy: https://pastebin.canonical.com/133640/

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-06-22:

#29

How do you set the VIP address from the charms?
It seems to me you're trying to convince juju to take the VIP as a public address, but perhaps in some weird way, which interferes with the way juju selects private/public addresses for units.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-06-22:

#30

It's relatively easy to debug (at TRACE log level) what address gets selected for public or private. There's a series of:

...exactScope match: index=..
...fallbackScope match: index=...
etc.

All of them in juju.network - address.go logger.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-23:

#31

Dimiter, we don't set the VIP from a charm directly, those are managed via corosync/pacemaker. Typically, the hacluster charm would configure a ocf:heartbeat:IPaddr2 resource for this, eg. as:

primitive res_mysql_vip ocf:heartbeat:IPaddr2 \
params ip="172.20.168.104" cidr_netmask="24" nic="eth0"

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-24:

#32

I'm also starting to wonder about the semantics of "public-address" here.

I could be totally off, but from my reading of code/comments in state/unit.go and state/machine.go public addresses of machines are derived from a) what the provider thinks the machines' address should be, and b) what the machine itself thinks its address should be.

The latter IMHO is a questionable source of truth for this purpose. In the face of multihoming, failover, virtual ip addresses, NATting, tunnels &c. a machines' idea of how it should be reached can be irrelevant and actually highly misleading. IMHO this shouldn't factor into the decision on how a machine should be addressed at all.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-06-25:

#33

Both private and public addresses are determined this way - from a merged list of what the provider tells us about the machine and what we can see on the machine itself, which is then sorted by value (e.g. 10.10.0.10 before 10.11.0.10) and scope (public, local-cloud, local-machine for public-address; local-cloud, local-machine, public for private-address). Provider addresses and instance status are polled frequently (every few seconds) until "running", then less frequently (every 15m). Machine addresses are read once on machine agent startup and updated (filtering lxcbr0 addresses first, if any). Changes to machine addresses are not detected unless the machine agent is restarted (or of course after a reboot). So the machine addresses shouldn't be a problem really.

Is it possible that setting the VIP address set by corosync/pacemaker somehow triggers either a restart of the machine agent or changes what juju sees in the 'ip_addresses' field of MAAS node details API?

I've seen somewhere in the logs that disable-network-management environment setting is set to true. Why is that? This means juju won't create juju-br0 on MAAS.

Let's sync up on IRC - when will you be available?

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-06-25:

#34

Even if polled less frequently I don't see how machine-provided addresses can do any good, and they potentially can do a lot of harm - as indeed they seem to have done here.

Just setting a VIP by corosync should never trigger a machine restart, that would counteract the very purpose of corosync for this usecase. Corosync is unaware of juju; I don't know of any interaction with MAAS.

We set disable-network-management because we're setting up our own bridging and bonding. Do you think that caused any issue here?

I'm on CEST, ping me anytime on IRC.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-06-26:

#35

We had an IRC chat to get more info, but I'm unable to connect to the bootstack-infra jumphost despite trying multiple configs/combinations of args.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-06-29:

#36

Following Peter's instructions I managed to observe the issue. Few notes:

* It only happens after upgrading 1.20.x to 1.22.x (or later).
* It happens because the machine agent is restarted during the upgrade, which then triggers the machiner worker to discover all IP addresses it can see on the machine and save them with SetMachineAddresses API.
* The above *only* happens if the ceilometer-hacluster unit agent is restarted before the machine agent and has a chance to add the HA VIP on the machine.

As discussed the short-term solution we can do for 1.24.2 is to introduce a new environment setting (flag) like "ignore-machine-addresses" - false by default, but when set (which should be possible both at bootstrap and in a running environment) causes the following change in behavior:
* The machiner worker (and anything else that might call SetMachineAddresses API) won't discover machine IP addresses on startup and wont save them in the list of all addresses
* Instead, SetMachineAddresses API will be called with an empty list, which in combination with "ignore-machine-addresses" flag set will cause any current machine addresses stored to be reset to empty (so just the provider addresses remain).
* Setting the machine addresses to empty will in turn trigger any uniters on those machines to fire config-changed hooks for its hosted units.

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-07-10:

#37

Fix proposed in http://reviews.vapour.ws/r/2138/

Ian Booth (wallyworld) on 2015-07-15

Changed in juju-core:
assignee:	nobody → Ian Booth (wallyworld)
status:	Triaged → Fix Committed

Curtis Hovey (sinzui) on 2015-07-21

Changed in juju-core:
status:	Fix Committed → Fix Released

Revision history for this message

JuanJo Ciarlante (jjo) wrote on 2015-10-14:

#38

@thumper: thanks for committing the fix for 1.22.x series.

Will juju 1.22 upgraded agent properly use previously
previously set-env'd value by 1.20 ?

Asking because with 1.20.14 seems to be saving it
(albeit unknown), but as a string rather than bool:

$ juju version
1.20.14-trusty-amd64
$ juju set-env ignore-machine-addresses=true
$ juju get-env |grep true
disable-network-management: true
ignore-machine-addresses: "true"
proxy-ssh: true
ssl-hostname-verification: true

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-10-14:

#39

FTR, this bug still manifests when upgrading from 1.22.6 --> 1.24.6

Revision history for this message

Tim Penhey (thumper) wrote on 2015-11-23:

#40

@jjo yes when storing in 1.20.14, juju does not know about the configuration option, but is happy to store it.

When the upgrade happens and the value is then read, it is converted to a bool (at least it appears to be when tested locally).

@peter-sabaini when trying from 1.22.6 -> 1.24.6 did you set the value in the environment before attempting the upgrade?

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2015-11-23:

#41

I've ran both cases, with and without ignore-machine-addresses; w/o I could observe the above bug, ie. HA VIPs ending up on lxc containers. With ignore-machine-addresses we get bug #1509292. Also cf. bug #1507867 for more context

James Page (james-page) on 2017-02-23

Changed in hacluster (Juju Charms Collection):
status:	New → Invalid

James Page (james-page) on 2017-03-30

Changed in charm-hacluster:
status:	New → Invalid

juju-core

Failed upgrade, mixed up HA addresses

Bug Description

Other bug subscribers

Related blueprints

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack HA Cluster Charm	Invalid	Undecided	Unassigned
juju-core	Fix Released	High	Ian Booth	juju-core 1.25-alpha1
1.22	Fix Released	High	Tim Penhey
1.24	Fix Released	High	Ian Booth	juju-core 1.24.3
hacluster (Juju Charms Collection)	Invalid	Undecided	Unassigned