changing vip after deployment does not work

Bug #1952363 reported by Andre Ruiz
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Gnocchi Charm
New
Undecided
Unassigned
OpenStack Barbican Charm
New
Undecided
Unassigned
OpenStack Designate Charm
Triaged
High
Unassigned
OpenStack HA Cluster Charm
Invalid
High
Unassigned
OpenStack Magnum Charm
New
Undecided
Unassigned
OpenStack Octavia Charm
New
Undecided
Unassigned
charms.openstack
Fix Released
High
Felipe Reyes
vault-charm
Fix Released
High
Felipe Reyes

Bug Description

Deploy 3 units of an application related to hacluster and a vip set. Change the vip later. The application units, the hacluster units and potentially vault units will execute for some time and finish without errors.

- A new vip resource will appear in crm status, but the old one will not be deleted
- The new vip resource will not be started, new ip will not be added to interface
- Old ip will not be cleared and will still be present on the interface

In the end, except for a new vip resource appearing in crm, nothing will happen.

I would expect the old one to be stopped (the ip removed from nic), deleted and the new one started (new ip added to nic).

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1952363] [NEW] changing vip after deployment does not work

Hi Andre,

Thanks for the report. I have a couple of questions/requests:

1) What's the application related to hacluster?

2) Could you provide us with an anonymized version of the following
commands:

In a hacluster unit run:
sudo crm configure show
sudo crm status

In the juju client run:

juju show-unit hacluster/N
juju show-unit $PRINCIPLE_CHARM/N

(note: replace N with the unit id)

3) Provide us a juju crashdump or a copy of /var/log/juju from the
hacluster units.

Thanks,

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

I manually started the new resources, then stopped and deleted the old resources. The ips are now correctly changed on the nics.

But then the charm is now showing a state:

  hacluster-designate/1 waiting idle 100.126.0.154 Resource: res_designate_158f532_vip not yet configured
  hacluster-designate/2 waiting idle 100.126.0.161 Resource: res_designate_158f532_vip not yet configured
  hacluster-designate/0* waiting idle 100.126.0.135 Resource: res_designate_158f532_vip not yet configured

As if it's trying to create yet another resource (this ID is different from the other two).

Revision history for this message
Felipe Reyes (freyes) wrote :

I'm going to set the bug as "incomplete", please once the data requested in comment #1 is provided set it back to new.

Changed in charm-hacluster:
status: New → Incomplete
Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

This is a generic issue and not specifically related to a deployment I'm doing now. I could just reproduce it on LXD with a very simple bundle (just keystone + database + hacluster).

This is the bundle I used: https://pastebin.canonical.com/p/HTy3csYwjN/

This is before the change:

root@juju-88a011-3:~# ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
157: eth0@if158: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:d2:78:07 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 198.18.2.188/24 brd 198.18.2.255 scope global dynamic eth0
       valid_lft 2148sec preferred_lft 2148sec
    inet 198.18.2.49/24 brd 198.18.2.255 scope global secondary eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fed2:7807/64 scope link
       valid_lft forever preferred_lft forever
159: eth1@if160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:65:6f:ab brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::216:3eff:fe65:6fab/64 scope link
       valid_lft forever preferred_lft forever

root@juju-88a011-3:~# crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-88a011-3 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Fri Nov 26 14:52:53 2021
  * Last change: Fri Nov 26 14:52:45 2021 by root via crm_node on juju-88a011-3
  * 3 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ juju-88a011-3 juju-88a011-4 juju-88a011-5 ]

Full List of Resources:
  * Resource Group: grp_ks_vips:
    * res_ks_4f63c45_vip (ocf::heartbeat:IPaddr2): Started juju-88a011-3
  * Clone Set: cl_ks_haproxy [res_ks_haproxy]:
    * Started: [ juju-88a011-3 juju-88a011-4 juju-88a011-5 ]

root@juju-88a011-3:~# crm configure show
node 1000: juju-88a011-3
node 1001: juju-88a011-4
node 1002: juju-88a011-5
primitive res_ks_4f63c45_vip IPaddr2 \
 params ip=198.18.2.49 \
 op monitor timeout=20s interval=10s \
 op_params depth=0
primitive res_ks_haproxy lsb:haproxy \
 meta migration-threshold=INFINITY failure-timeout=5s \
 op monitor interval=5s
group grp_ks_vips res_ks_4f63c45_vip
clone cl_ks_haproxy res_ks_haproxy
property cib-bootstrap-options: \
 have-watchdog=false \
 dc-version=2.0.3-4b1f869f0f \
 cluster-infrastructure=corosync \
 cluster-name=debian \
 no-quorum-policy=stop \
 cluster-recheck-interval=60 \
 stonith-enabled=false \
 last-lrm-refresh=1637938243
rsc_defaults rsc-options: \
 resource-stickiness=100 \
 failure-timeout=180

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

This was the change:

>>> 11:57:15 andre@thinkpad ~
$ juju config keystone vip
198.18.2.49
>>> 11:57:21 andre@thinkpad ~
$ juju config keystone vip="198.18.2.48"
>>> 12:00:38 andre@thinkpad ~
$ juju config keystone vip
198.18.2.48

and this was AFTER the change:

root@juju-88a011-3:~# ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
157: eth0@if158: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:d2:78:07 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 198.18.2.188/24 brd 198.18.2.255 scope global dynamic eth0
       valid_lft 3252sec preferred_lft 3252sec
    inet 198.18.2.49/24 brd 198.18.2.255 scope global secondary eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fed2:7807/64 scope link
       valid_lft forever preferred_lft forever
159: eth1@if160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:65:6f:ab brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::216:3eff:fe65:6fab/64 scope link
       valid_lft forever preferred_lft forever

root@juju-88a011-3:~# crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-88a011-3 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Fri Nov 26 15:04:24 2021
  * Last change: Fri Nov 26 14:58:16 2021 by hacluster via crmd on juju-88a011-4
  * 3 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ juju-88a011-3 juju-88a011-4 juju-88a011-5 ]

Full List of Resources:
  * Resource Group: grp_ks_vips:
    * res_ks_4f63c45_vip (ocf::heartbeat:IPaddr2): Started juju-88a011-3
  * Clone Set: cl_ks_haproxy [res_ks_haproxy]:
    * Started: [ juju-88a011-3 juju-88a011-4 juju-88a011-5 ]
  * res_ks_7c21ae6_vip (ocf::heartbeat:IPaddr2): Started juju-88a011-4

root@juju-88a011-3:~# crm configure show
node 1000: juju-88a011-3
node 1001: juju-88a011-4
node 1002: juju-88a011-5
primitive res_ks_4f63c45_vip IPaddr2 \
 params ip=198.18.2.49 \
 op monitor timeout=20s interval=10s \
 op_params depth=0
primitive res_ks_7c21ae6_vip IPaddr2 \
 params ip=198.18.2.48 \
 op monitor timeout=20s interval=10s \
 op_params depth=0
primitive res_ks_haproxy lsb:haproxy \
 meta migration-threshold=INFINITY failure-timeout=5s \
 op monitor interval=5s
group grp_ks_vips res_ks_4f63c45_vip
clone cl_ks_haproxy res_ks_haproxy
property cib-bootstrap-options: \
 have-watchdog=false \
 dc-version=2.0.3-4b1f869f0f \
 cluster-infrastructure=corosync \
 cluster-name=debian \
 no-quorum-policy=stop \
 cluster-recheck-interval=60 \
 stonith-enabled=false \
 last-lrm-refresh=1637938696
rsc_defaults rsc-options: \
 resource-stickiness=100 \
 failure-timeout=180

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Here are show units for keystone/0 and keystone-hacluster/0 before and after the change:

Before -> https://pastebin.canonical.com/p/Ryy785VcSX/
After -> https://pastebin.canonical.com/p/V5Z9nNNp2Q/

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

and this is a crashdump AFTER the change.

Changed in charm-hacluster:
status: Incomplete → New
Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

As can be seen, the new IP appears as a second vip resource. The first one is not removed. Also, the alias IP on the nic is still the old one.

If you go ahead and manually do a "crm start" on the second vip, it works and appears on the nic, together with the old one. You can also do a "crm stop" on the old one, it will also work.

But when you try to delete the old one, the charm will change status to "resource <NEW-ID> not configured yet" and it's difficult to recover from that.

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Just to be clear, I'm showing only one unit here because the other two did not have any VIPs (neither before, nor after). I checked that the new VIP did *not* start on a different unit.

Revision history for this message
Felipe Reyes (freyes) wrote :

I was able to reproduce it with these steps:

juju deploy ./my-vault.yaml # https://paste.ubuntu.com/p/P9QyKjtCRV/
juju wait

initialize/unsel vault.

# change the vip from 10.5.250.250 to 10.5.250.251
juju config vault vip=10.5.250.251

juju show-unit hacluster/0
....
          json_groups: '{"grp_vault-ext_vips": "res_vault-ext_9ede165_vip res_vault-ext_c02fdce_vip"}'
          json_resource_params: '{"res_vault-ext_9ede165_vip": " params ip=\"10.5.250.251\" meta
            migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor timeout=\"20s\"
            interval=\"10s\" depth=\"0\"", "res_vault-ext_c02fdce_vip": " params
            ip=\"10.5.250.250\" meta migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op
            monitor timeout=\"20s\" interval=\"10s\" depth=\"0\""}'
          json_resources: '{"res_vault-ext_9ede165_vip": "ocf:heartbeat:IPaddr2",
            "res_vault-ext_c02fdce_vip": "ocf:heartbeat:IPaddr2"}'
....

There should be a json_delete_resources key to ask the deletion of the vip and also the old vip shouldn't be present in json_resource_params .

Changed in charm-hacluster:
status: New → Triaged
importance: Undecided → High
Felipe Reyes (freyes)
Changed in vault-charm:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Felipe Reyes (freyes)
Revision history for this message
Felipe Reyes (freyes) wrote :

the same behavior can be seen in designate charm.

this is the diff of "crm configure show" when comparing a fresh deployed cluster using ip address 10.5.100.0 with the cluster after changing the vip to 10.5.100.1

$ diff -u crm_configure_show.{1,2}
--- crm_configure_show.1 2021-11-29 19:47:26.482821006 -0300
+++ crm_configure_show.2 2021-11-29 20:00:11.349423541 -0300
@@ -1,6 +1,11 @@
 node 1000: juju-ea4e96-designate-2
 node 1001: juju-ea4e96-designate-1
 node 1002: juju-ea4e96-designate-3
+primitive res_designate_242d562_vip IPaddr2 \
+ params ip=10.5.100.1 \
+ meta migration-threshold=INFINITY failure-timeout=5s \
+ op monitor timeout=20s interval=10s \
+ op_params depth=0
 primitive res_designate_bf9661e_vip IPaddr2 \
  params ip=10.5.100.0 \
  meta migration-threshold=INFINITY failure-timeout=5s \
@@ -19,7 +24,7 @@
  no-quorum-policy=stop \
  cluster-recheck-interval=60 \
  stonith-enabled=false \
- last-lrm-refresh=1638215445
+ last-lrm-refresh=1638226172
 rsc_defaults rsc-options: \
  resource-stickiness=100 \
  failure-timeout=180

Changed in charm-designate:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

Actually I first saw this problem in vault and then in designate, by pure coincidence. But I expected it to be rather generic and affect every application clustered with hacluster subordinate charm (that's why I tested with keystone next, as initially reported).

Revision history for this message
Felipe Reyes (freyes) wrote :
Changed in vault-charm:
status: Triaged → In Progress
Revision history for this message
Felipe Reyes (freyes) wrote :

Adding a task for charms.openstack since the class HAOpenStackCharm manages the vip(s) for OpenStack charms in the method _add_ha_vips_config()

https://opendev.org/openstack/charms.openstack/src/branch/master/charms_openstack/charm/classes.py#L864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charms.openstack (master)
Changed in charms.openstack:
status: New → In Progress
Felipe Reyes (freyes)
Changed in charms.openstack:
assignee: nobody → Felipe Reyes (freyes)
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charms.openstack (master)

Reviewed: https://review.opendev.org/c/openstack/charms.openstack/+/819912
Committed: https://opendev.org/openstack/charms.openstack/commit/58fc4dadac4a40ab0d2e78897e0c77ffeeacb051
Submitter: "Zuul (22348)"
Branch: master

commit 58fc4dadac4a40ab0d2e78897e0c77ffeeacb051
Author: Felipe Reyes <email address hidden>
Date: Tue Nov 30 15:49:40 2021 -0300

    Register previously vip set for deletion.

    When the vip is changed the ones that are no longer present need to be
    registered for deletion from pacemaker's configuration. This change
    relies on hookenv.config.changed() to determine what vip(s) are no
    longer present in the configuration and ask hacluster to remove them.

    Closes-Bug: #1952363
    Change-Id: I1afe987ff26af0e10604dd507daef4ac282d9aab

Changed in charms.openstack:
status: In Progress → Fix Released
Changed in vault-charm:
status: In Progress → Fix Committed
milestone: none → 22.04
Revision history for this message
Felipe Reyes (freyes) wrote :

Marking task for hacluster charm as invalid since the issues are on the principal charm side.

Changed in charm-hacluster:
status: Triaged → Invalid
Changed in vault-charm:
status: Fix Committed → Fix Released
Revision history for this message
Felipe Alencastro (falencastro) wrote :

We're adding a third vip for all ha applications, and found some of them to be affected by this (magnum, gnocchi, barbican) while others are not (aodh, heat, ceilometer, openstack-dashboard).

On the affected applications we have to remove the hacluster relation, which causes a service downtime, and then readd for it to work.

Meanwhile on the non affecteds the third vip is added, works fine but is not included on the proper resource group. e.g: grp_gnocchi_vips. We add it manually afterwards with crm configure edit.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.