Failed upgrade, mixed up HA addresses
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | OpenStack hacluster charm |
Undecided
|
Unassigned | ||
| | juju-core |
High
|
Ian Booth | ||
| | 1.22 |
High
|
Tim Penhey | ||
| | 1.24 |
High
|
Ian Booth | ||
| | hacluster (Juju Charms Collection) |
Undecided
|
Unassigned | ||
Bug Description
After upgrading juju from 1.20.14 to 1.22.1, one mysql HA container a) failed to upgrade and b) had an ip address overwritten by the hacluster VIP address.
The mysql lxc's physically have those addresses:
mysql/0: 172.20.171.204
mysql/1: 172.20.172.11
mysql/2: 172.20.171.253
The HA-cluster VIP is set to 172.20.168.104.
After the upgrade, mysql/0 still is on the old juju version and has it's address set to the former VIP:
mysql/0:
machine: 0/lxc/0
mysql/1:
machine: 1/lxc/4
mysql/2:
machine: 2/lxc/4
Notes:
* the lxc's are still reachable under the original addresses, ie. the physical networking, sans HA of course, is still intact, eg.:
$ for h in 172.20.171.204 172.20.172.11 172.20.171.253 ; do ssh -l ubuntu $h hostname ; done
Warning: Permanently added '172.20.171.204' (ECDSA) to the list of known hosts.
juju-machine-
Connection to 172.20.171.204 closed.
Warning: Permanently added '172.20.172.11' (ECDSA) to the list of known hosts.
juju-machine-
Connection to 172.20.172.11 closed.
Warning: Permanently added '172.20.171.253' (ECDSA) to the list of known hosts.
juju-machine-
Connection to 172.20.171.253 closed.
* I've hit lp:1441478 during upgrade and applied the manual db hackery mentioned in comment #2
* Also, during the upgrade machine-1 lost DNS resolution for short time due to dhcp flakiness (unrelated)
| description: | updated |
| Changed in juju-core: | |
| milestone: | none → 1.25.0 |
| Changed in juju-core: | |
| importance: | Undecided → High |
| tags: | added: blocker |
| Changed in juju-core: | |
| status: | New → In Progress |
| assignee: | nobody → Eric Snow (ericsnowcurrently) |
| Eric Snow (ericsnowcurrently) wrote : | #1 |
| tags: | added: ha upgrade-juju |
| Changed in juju-core: | |
| assignee: | Eric Snow (ericsnowcurrently) → nobody |
| status: | In Progress → Triaged |
In this context, is HA Juju's concept of HA? Or MySQL's?
| Changed in juju-core: | |
| status: | Triaged → Incomplete |
| Peter Sabaini (peter-sabaini) wrote : | #3 |
Uploading /var/log of machine-0
| Eric Snow (ericsnowcurrently) wrote : | #4 |
I took a look but really don't have enough familiarity with how upgrade works to make much progress without logs.
| Peter Sabaini (peter-sabaini) wrote : | #5 |
Uploading /var/log of machine-1
| Peter Sabaini (peter-sabaini) wrote : | #6 |
HA as implemented by the hacluster charm, ie. corosync/pacemaker
| Peter Sabaini (peter-sabaini) wrote : | #7 |
I failed to mention yesterday that keystone and possibly other
HA-ified services had similar issues as mysql.
Also, I've retried an upgrade today:
a. Upgrading from 1.20.14 --> 1.22.5.
b. Issue lp:1441478 struck again
c. On the mysql units I could observe a similar pattern as yesterday, after ~10min I could see the HA VIP on mysql/0 https:/
d. However! After another ~10mins the mysql service partially recovered, mysql/0 got it's proper ip back. Still the mysql-hacluster subordinate still was down, and the config-changed hook hung:
'''2015-06-10 08:43:35 INFO juju.utils.fslock fslock.go:146 attempted lock failed "uniter-
This didn't change after 1h+
e. Also, the keystone service had the similar issues after 1h+, ie. the HA VIP on keystone/1; note the agent-versions: https:/
I've dumped /var/lib and /var/log from machines-{1,2,3} and can make them available.
| Peter Sabaini (peter-sabaini) wrote : | #8 |
Forgot to add status of mysql after partial resolve:
mysql/0:
machine: 0/lxc/0
mysql/1:
machine: 1/lxc/4
mysql/2:
machine: 2/lxc/4
| Eric Snow (ericsnowcurrently) wrote : | #9 |
Sorry we haven't been able to look at this closely yet, Peter. We'll be doing so today and hopefully have answers for you soon.
| Changed in juju-core: | |
| status: | Incomplete → Triaged |
| Cheryl Jennings (cherylj) wrote : | #10 |
Looking at the logs, I see that machine-0-lxc-0 tried to upgrade, but failed to fetch the tools:
2015-06-09 10:48:25 ERROR juju.worker.
After that, it was never able to connect to the state server, but is dialing a different address than when it had connection previously:
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-09 11:07:11 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://
Other containers failed to fetch tools from 10.0.3.1, but tried other servers and succeeded. Somehow the addresses for the servers got messed up. Still looking...
| Cheryl Jennings (cherylj) wrote : | #11 |
I think this might be a dup of bug #1416928.
| Peter Sabaini (peter-sabaini) wrote : | #12 |
Cheryl, I've briefly looked at bug #1416928 because I wondered the same thing. As a test I've rebooted machine-0 and got this in /var/lib/
apiaddresses:
- 172.20.168.2:17070
- 172.20.168.3:17070
- 172.20.168.4:17070
I.e. the priv_net addresses, not lxcbr0.
Also, shouldn't that bug be fixed in 1.22.1 and 1.22.5?
| Cheryl Jennings (cherylj) wrote : | #13 |
When machine-0 encountered this problem, it hadn't updated yet to 1.22, so it was still on 1.20.14.
Also, regarding the overwriting of the IP, I see in the log that the IP wasn't overwritten, but rather assigned multiple values, and juju status most likely just picked one to display (IIRC there was a bug not too long ago regarding the ordering of IPs displayed in juju status?):
2015-06-09 10:46:17 INFO juju.worker.
Going to pull in some folks with more networking knowledge to take a look.
| James Tunnicliffe (dooferlad) wrote : | #14 |
10.0.3.1 is suspicious since that is what lxc will pick for a bridge, so the theory that bug #1416928 is at least related seems reasonable. It looks like the URL that is being provided by the upgrader Tools API call (which is just common.
| Dimiter Naydenov (dimitern) wrote : | #15 |
Looking at the logs I think the issue is around not having an upgrade step form 1.20.x to 1.22.x which filters out any lxcbr0 addresses from the API hosts/ports. 10.0.3.1 was among the list of API hostsPorts along with quite a lot of IPv6 link-local addresses:
https:/
So: 1) to fix the issue some mongo surgery is needed (to salvage the environment) - drop all 10.0.3.0/24 and fe80::/64 addresses from the hostPorts (stateServers?) collections; 2) to make sure it doesn't happen again - do an upgrade step, run on the DatabaseMaster which gets current API hostPorts converts them to []network.Address for each server (network.
| Cheryl Jennings (cherylj) wrote : | #16 |
Looking at this a bit more I'm getting more convinced that the upgrade failure is due to bug #1416928. I see that all of the containers I've sampled attempt to get the tools from the 10.0.3.1 address. They eventually succeed where machine-0-lxc-0 fails because they get an update that corrects the apiserver IPs to not include 10.0.3.1 (presumably because the state servers have been updated).
However, on machine-0-lxc-0, the watcher's connection to the state server dies before it gets the update:
2015-06-09 10:48:25 ERROR juju.worker.
2015-06-09 10:48:30 INFO juju.worker.
2015-06-09 11:05:01 ERROR juju.state.
...
2015-06-09 11:05:01 ERROR juju.state.
2015-06-09 11:05:01 INFO juju.cmd.jujud agent.go:177 error pinging *api.State: connection is shut down
2015-06-09 11:05:01 ERROR juju.worker runner.go:207 fatal "upgrader": error receiving message: read tcp 172.20.168.4:17070: connection timed out
Since machine-lxc-0 is still running 1.20.14, it doesn't filter out the 10.0.3.1 addresses when it tries to reconnect to the state servers:
2015-06-09 11:05:01 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-06-09 11:05:04 INFO juju.worker runner.go:260 start "api"
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-09 11:05:04 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-09 11:07:11 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-09 11:07:11 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://
At this point, it never reconnects to the state servers because it's using the wrong IP. The fix for this would be the fix that's been released for bug #1416928.
| Cheryl Jennings (cherylj) wrote : | #17 |
Peter, I chatted with some folks about this again this afternoon, and there's not really anything we can do from a code change perspective since the upgrade failure is caused by a problem fixed in 1.22, and we are not releasing any more updates to 1.20. The one container, machine-0-lxc-0, just failed to update as the watcher lost its connection with the state server before getting the correct state server addresses.
If you hit this again, you should be able to manually modify the agent.conf file to remove the 10.0.3.1 address. I suspect they didn't show up as you mentioned in comment #12 because machine-0 at that time had been updated to 1.22, which filters them appropriately.
Regarding the HA VIP address, juju status picks one of the addresses associated with the machine to display as the public address, and I can see in the logs that machine-0-lxc-0 had both 172.20.171.204 and 172.20.168.104 assigned to it. Should it not have both?
I'll be around in the morning my time tomorrow if you want to chat on IRC about this.
| Peter Sabaini (peter-sabaini) wrote : | #18 |
Re. the HA VIP address, I'm not really sure how the internals work here, but aiui the HA address is freely floating between instances and shouldn't be directly associated with a machine. I've never seen the HA address show up as the public-address of a container, fwiw. As an admin, it would be fairly surprising to me to get bounced to a different machine, just because the VIP has moved.
This is managed by the hacluster charm via the "ha" relation; not sure if this directly associated with a machine at all?
| no longer affects: | juju-core/1.22 |
| Cheryl Jennings (cherylj) wrote : | #19 |
From what I've read on the HA VIP with mysql, I would expect that the HA VIP would be assigned to one of the machines in the cluster at all times and would fail over to another machine should the first one become unresponsive. I'm just double checking that this is the expected behavior.
Once this HA VIP issue gets resolved, I'll close this bug out as a DUP of bug #1416928. We aren't able to backport that fix to 1.20, but there is a workaround of manually modifying the agent.conf file should you run into this again.
| Peter Sabaini (peter-sabaini) wrote : | #20 |
Yes, the HA VIP would be assigned to one of the cluster nodes at all times -- but IMO it shouldn't show up as the juju "public-address" of a container, because it just isn't. Eg. one of the upshots of the VIP leakage is that the HA VIP shows up as the node address in corosync.conf, which corosync fails horribly on, as a VIP can't be a node address at the same time.
Regarding the lxcbr0 address, this is probably #1416928. When watching /var/lib/
| Peter Sabaini (peter-sabaini) wrote : | #21 |
FTR, i've now had some success in juju-upgrading by stopping all HA resources prior to upgrading:
1) stop all hacluster resources
2) trigger upgrade
3) apply mongo hackery as per lp:1441478, restarting primary state server
4) reboot hanging lxc containers (check for hanging jujud-machine daemons)
5) start hacluster resources (check that resources started cleanly)
6) resolve -r error'ed juju units
7) fix broken netns as necessary (cf. lp:1350947)
| Cheryl Jennings (cherylj) wrote : | #22 |
Apologies for the delay in getting back to you, but I didn't get notification of your updates.
At this point, I need to pull in someone who has more networking knowledge to help out. I'll email him this evening and get back with you tomorrow.
| Dimiter Naydenov (dimitern) wrote : | #23 |
Please, provide logs with the issues you're facing.
| James Troup (elmo) wrote : | #24 |
Dimiter, logs were provided earlier in the bug. If you need more, newer or different logs, that's fine, but please be more specific about what it is you need.
| Peter Sabaini (peter-sabaini) wrote : | #25 |
Update - that workaround in comment #21 turned out to be unreliable. We've had success on one upgrade, but on another one we see VIPs leaking into public-addresses
| James Tunnicliffe (dooferlad) wrote : | #26 |
It looks like 0-lxc-0 lost its connection to state during the upgrade to 1.22. It tried to download tools from a location that didn't work and didn't try anywhere else. 0-lxc-1 encountered the same problem, but then tried another tools location. I haven't worked out why there is a difference between the two containers, but that would be my next step.
<dooferlad> so, looking at machine-
<dooferlad> this is the same as before the upgrade
<chrome0> where local-cloud:
<dooferlad> yes
<dooferlad> is there basic network connectivity after the upgrade? Is it a case of "something picked the wrong IP address from that pair"?
<chrome0> basic network conn. is there, yes, if not using the VIP
<chrome0> and "something picked the wrong IP from that pair" sounds right as well :-)
<dooferlad> so the VIP isn't usable inside the cluster? Just outside the cluster it does work?
<chrome0> the VIP should be usable from the cluster, but i remember not being able to ssh' to it
<chrome0> "juju ssh
<chrome0> that is
<dooferlad> so, 172.20.168.104 doesn't show up in the unit-mysql-0.log after the upgrade, neither does 172.20.171.204.
<dooferlad> in fact it looks like 0-lxc-0 doesn't actually come back up with network connectivity
<dooferlad> so I would save "ip route -n" pre and post upgrade for machine 0 and machin 0-lxc-0
<dooferlad> and also "sudo iptables-save" (pre and post)
<chrome0> on staging we've had to reboot lxc's because of https:/
<dooferlad> but that should have been fixed after the upgrade
<chrome0> apparently it hit during/before the upgrade...
<chrome0> aiui
<dooferlad> OK, so from machine-
<dooferlad> machine-0, machine-1 and all other LXCs got the new jujud.
<dooferlad> so if the VIP was pointing at machine-0-lxc-0, then that is a big problem. The VIP needed to point to another machine when 0-lxc-0 didn't upgrade
<dooferlad> Both 0-lxc-0 and 0-lxc-1 tried to download from the same location at the same time, both failed, only 0-lxc-1 tried another location.
<dooferlad> it looks like 0-lxc-0 lost its connection to state and crapped out.
| Billy Olsen (billy-olsen) wrote : | #27 |
Marking as affecting hacluster in the openstack charms as there may be a change we can use to prevent using the VIP in this scenario as a sanity check.
A work around discussed with jillrouleau is to force modify the hacluster template to include the IP addresses and node assignments.
| Jill Rouleau (jillrouleau) wrote : | #28 |
Regarding the HA VIP assignment. On a fresh deploy of 1.24 we had this happen with 2 openstack services. For example with glance: https:/
| Dimiter Naydenov (dimitern) wrote : | #29 |
How do you set the VIP address from the charms?
It seems to me you're trying to convince juju to take the VIP as a public address, but perhaps in some weird way, which interferes with the way juju selects private/public addresses for units.
| Dimiter Naydenov (dimitern) wrote : | #30 |
It's relatively easy to debug (at TRACE log level) what address gets selected for public or private. There's a series of:
...exactScope match: index=..
...fallbackScope match: index=...
etc.
All of them in juju.network - address.go logger.
| Peter Sabaini (peter-sabaini) wrote : | #31 |
Dimiter, we don't set the VIP from a charm directly, those are managed via corosync/pacemaker. Typically, the hacluster charm would configure a ocf:heartbeat:
primitive res_mysql_vip ocf:heartbeat:
params ip="172.20.168.104" cidr_netmask="24" nic="eth0"
| Peter Sabaini (peter-sabaini) wrote : | #32 |
I'm also starting to wonder about the semantics of "public-address" here.
I could be totally off, but from my reading of code/comments in state/unit.go and state/machine.go public addresses of machines are derived from a) what the provider thinks the machines' address should be, and b) what the machine itself thinks its address should be.
The latter IMHO is a questionable source of truth for this purpose. In the face of multihoming, failover, virtual ip addresses, NATting, tunnels &c. a machines' idea of how it should be reached can be irrelevant and actually highly misleading. IMHO this shouldn't factor into the decision on how a machine should be addressed at all.
| Dimiter Naydenov (dimitern) wrote : | #33 |
Both private and public addresses are determined this way - from a merged list of what the provider tells us about the machine and what we can see on the machine itself, which is then sorted by value (e.g. 10.10.0.10 before 10.11.0.10) and scope (public, local-cloud, local-machine for public-address; local-cloud, local-machine, public for private-address). Provider addresses and instance status are polled frequently (every few seconds) until "running", then less frequently (every 15m). Machine addresses are read once on machine agent startup and updated (filtering lxcbr0 addresses first, if any). Changes to machine addresses are not detected unless the machine agent is restarted (or of course after a reboot). So the machine addresses shouldn't be a problem really.
Is it possible that setting the VIP address set by corosync/pacemaker somehow triggers either a restart of the machine agent or changes what juju sees in the 'ip_addresses' field of MAAS node details API?
I've seen somewhere in the logs that disable-
Let's sync up on IRC - when will you be available?
| Peter Sabaini (peter-sabaini) wrote : | #34 |
Even if polled less frequently I don't see how machine-provided addresses can do any good, and they potentially can do a lot of harm - as indeed they seem to have done here.
Just setting a VIP by corosync should never trigger a machine restart, that would counteract the very purpose of corosync for this usecase. Corosync is unaware of juju; I don't know of any interaction with MAAS.
We set disable-
I'm on CEST, ping me anytime on IRC.
| Dimiter Naydenov (dimitern) wrote : | #35 |
We had an IRC chat to get more info, but I'm unable to connect to the bootstack-infra jumphost despite trying multiple configs/
| Dimiter Naydenov (dimitern) wrote : | #36 |
Following Peter's instructions I managed to observe the issue. Few notes:
* It only happens after upgrading 1.20.x to 1.22.x (or later).
* It happens because the machine agent is restarted during the upgrade, which then triggers the machiner worker to discover all IP addresses it can see on the machine and save them with SetMachineAddresses API.
* The above *only* happens if the ceilometer-
As discussed the short-term solution we can do for 1.24.2 is to introduce a new environment setting (flag) like "ignore-
* The machiner worker (and anything else that might call SetMachineAddresses API) won't discover machine IP addresses on startup and wont save them in the list of all addresses
* Instead, SetMachineAddresses API will be called with an empty list, which in combination with "ignore-
* Setting the machine addresses to empty will in turn trigger any uniters on those machines to fire config-changed hooks for its hosted units.
| Ian Booth (wallyworld) wrote : | #37 |
Fix proposed in http://
| Changed in juju-core: | |
| assignee: | nobody → Ian Booth (wallyworld) |
| status: | Triaged → Fix Committed |
| Changed in juju-core: | |
| status: | Fix Committed → Fix Released |
| JuanJo Ciarlante (jjo) wrote : | #38 |
@thumper: thanks for committing the fix for 1.22.x series.
Will juju 1.22 upgraded agent properly use previously
previously set-env'd value by 1.20 ?
Asking because with 1.20.14 seems to be saving it
(albeit unknown), but as a string rather than bool:
$ juju version
1.20.14-
$ juju set-env ignore-
$ juju get-env |grep true
disable-
ignore-
proxy-ssh: true
ssl-hostname-
| Peter Sabaini (peter-sabaini) wrote : | #39 |
FTR, this bug still manifests when upgrading from 1.22.6 --> 1.24.6
| Tim Penhey (thumper) wrote : | #40 |
@jjo yes when storing in 1.20.14, juju does not know about the configuration option, but is happy to store it.
When the upgrade happens and the value is then read, it is converted to a bool (at least it appears to be when tested locally).
@peter-sabaini when trying from 1.22.6 -> 1.24.6 did you set the value in the environment before attempting the upgrade?
| Peter Sabaini (peter-sabaini) wrote : | #41 |
I've ran both cases, with and without ignore-
| Changed in hacluster (Juju Charms Collection): | |
| status: | New → Invalid |
| Changed in charm-hacluster: | |
| status: | New → Invalid |


Do you have any logs for machine 0/lxc/0 that we could take a look at?