keepalived brings the VIP up in multiple places on a controller reboot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Expired
|
High
|
Unassigned |
Bug Description
keepalived may be racing network reconfiguration on reboot. In combination with a default quorum setting this leads to a split brain situation.
One trace of this is the "Netlink: filter function error" report from keepalived in syslog; I think this is down to OVS reconfiguring the network asynchronously - this seems to confuse keepalived (= 1.2.13-1).
The _consequence_ of this is that occasionally, when a controller node that's hosting the VIP is rebooted, the VIP moves correctly - but then the original node comes back up and (for some reason) cannot see the remainder of the cluster. The consequence is that the VIP is then brought back up on that controller.
I think there are two problems here. One's the keepalived race (if you SIGHUP keepalived on the rebooted host, it recovers correctly). The second is that keepalived should probably be configured with a quorum greater than the default 1 for this situation, which would prevent a split-brain keepalive from hosing the working half of the cluster.
A reproducer (which only works some of the time, alas):
$ for ii in 107 109 104; do ssh heat-admin@
overcloud-
10.22.170.107 10.22.170.105
overcloud-
10.22.170.109
overcloud-
10.22.170.104
$ nova stop 3d37ae90-
$ for ii in 107 109 104; do ssh -i .ssh/seed_id_rsa heat-admin@
ssh: connect to host 10.22.170.107 port 22: No route to host
overcloud-
10.22.170.109
overcloud-
10.22.170.104
$ for ii in 109 104; do ssh -i .ssh/seed_id_rsa heat-admin@
overcloud-
10.22.170.109 10.22.170.105
overcloud-
10.22.170.104
$ nova start 3d37ae90-
$ for ii in 107 109 104; do ssh -i .ssh/seed_id_rsa heat-admin@
overcloud-
10.22.170.107 10.22.170.105
overcloud-
10.22.170.109 10.22.170.105
overcloud-
10.22.170.104
The rebooted controller0 shows the keepalived trace of "Netlink: filter function error" which is often harmless, but does at least show that OVS has been having a tinker under the hood.
Changed in tripleo: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in tripleo: | |
assignee: | nobody → Julia Kreger (juliaashleykreger) |
Changed in tripleo: | |
assignee: | Julia Kreger (juliaashleykreger) → nobody |
https:/ /review. openstack. org/120443 is a really clunky workaround.