infinite cluster-relation-changed loop when upgrading 21.04 charm-cinder from bionic-queens to rocky

Bug #1928383 reported by Drew Freiberger
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Cinder Charm
Status tracked in Trunk
2023.1
Fix Released
Undecided
Unassigned
Train
Triaged
Undecided
Unassigned
Trunk
Fix Released
High
Alex Kavanagh
Ussuri
Triaged
Undecided
Unassigned
Victoria
Triaged
Undecided
Unassigned
Wallaby
Triaged
Undecided
Unassigned
Xena
Triaged
Undecided
Unassigned
Yoga
Triaged
Undecided
Unassigned
Zed
Fix Released
Undecided
Unassigned

Bug Description

I am performing a non-action-managed-upgrade of API charms for openstack from bionic distro (after a xenial-queens -> bionic series upgrade) on 21.04 openstack charms.

I have a 3 unit cinder deployment with typical hacluster and cinder-ceph subordinates.

When I run 'juju config cinder openstack-origin=cloud:bionic-rocky action-managed-upgrade=false', the upgrade appears to have updated all of the packages properly and config-changed succeeds. Then I witness a never-ending cascade effect of cluster-relation changed which appears to be coming from relation data changes from each of the units:

While one cluster-relation-changed is running, I captured the 'juju show-unit' of cinder/1 (one of the non-leaders) and the relation data for "cinder-db-initialised-echo" is constantly changing.

Here are two iterations of the data that diffed out from juju show-unit.
         cinder-db-initialised-echo: cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881
         cinder-db-initialised-echo: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774
         cinder-db-initialised-echo: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774

         cinder-db-initialised-echo: cinder/1-c1de47c8-dd95-43d6-8e6e-c5a11ef4d50f
         cinder-db-initialised-echo: cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881
         cinder-db-initialised-echo: cinder/1-c1de47c8-dd95-43d6-8e6e-c5a11ef4d50f

More context of this data from juju show-unit cinder/0:
    cinder/0:
      data:
        cinder-db-initialised: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774
        cinder-db-initialised-echo: cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881
    related-units:
      cinder/1:
        in-scope: true
        data:
          cinder-db-initialised: cinder/1-c1de47c8-dd95-43d6-8e6e-c5a11ef4d50f
          cinder-db-initialised-echo: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774
      cinder/2:
        in-scope: true
        data:
          cinder-db-initialised: cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881
          cinder-db-initialised-echo: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774

It looks like there was a race that happened with the database initialisation.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

subscribing field-critical as this is actively causing a volume service outage on a production cloud.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Worked around this by hard-setting the relation data for this initialised variable as was defined from the leader unit:

$ for i in 0 1 2; do juju run -u cinder/$i -- relation-set -r cluster:4 cinder-db-initialised=cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881 cinder-db-initialised-echo=cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881; done

Revision history for this message
Drew Freiberger (afreiberger) wrote :
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I'm about to do this tomorrow with ServerStack from queens to rocky. I'm going to to an action-managed-upgrade to see/verify if it has the same behaviour. If not, then this might be the recommended work-around for cinder; I'll update here after my upgrade.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Since a work around is available, dropping this to field-high - though this remains an important bug.

@afreiberger is it possible to get some sanitized logs for this?

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

If I'm reading this correctly, this upgrade has missed one of the key pre-upgrade requirements(https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/upgrade-series.html#pre-upgrade-requirements): "All currently deployed charms should be upgraded to the latest stable charm revision." Over the last two years, many improvements have been made to the charms in regards to how they handle upgrading things, from the charms themselves, to OpenStack, to Ubuntu.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I completely failed to read the 21 bit in the description, apologies!

James Troup (elmo)
tags: added: openstack-upgrade
Changed in charm-cinder:
assignee: nobody → Aurelien Lourot (aurelien-lourot)
Changed in charm-cinder:
assignee: Aurelien Lourot (aurelien-lourot) → nobody
Felipe Reyes (freyes)
Changed in charm-cinder:
assignee: nobody → Felipe Reyes (freyes)
Revision history for this message
Felipe Reyes (freyes) wrote :

Hi Drew,

I've been trying to reproduce this issue without success. I tried 2 different setups (with and without ceph).

This is how the different hooks were executed in my environment:

juju show-status-log cinder/0 http://paste.ubuntu.com/p/5B2SCCmDCx/
juju show-status-log cinder/1 http://paste.ubuntu.com/p/rWCWkZVZXq/
juju show-status-log cinder/2 http://paste.ubuntu.com/p/j7y7BZPgvv/

In case you still have the environment around it would be helpful if you could share the following information:

juju show-action-log <UNIT> # for each cinder unit.
/var/log/juju/{machine-lock.log,unit-cinder-*.log} from each cinder unit, you could grep by the date when the upgrade was run.

I will set the bug as incomplete, but please set it back to new once you have this dataset or any other information that could help us to reproduce the bug.

Best,

Changed in charm-cinder:
status: New → Incomplete
assignee: Felipe Reyes (freyes) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Cinder Charm because there has been no activity for 60 days.]

Changed in charm-cinder:
status: Incomplete → Expired
Revision history for this message
Pedro Victor Lourenço Fragola (pedrovlf) wrote :

I have an environment facing the same issue after upgrading from Victoria => Wallaby => Xena

juju show-status-log => https://pastebin.canonical.com/p/nq9wsdRFvn/

I'll try the workaround suggested in comment #2

Note: After stopping one of the units(lxd) the environment is stable, in this case this was the workaround suggested in https://bugs.launchpad.net/charms/+source/cinder/+bug/1561927

Changed in charm-cinder:
status: Expired → New
Revision history for this message
Pedro Victor Lourenço Fragola (pedrovlf) wrote :

Just a follow-up, after using the workaround from comment #2 the cinder units are stable.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I've done some more investigations (these are notes to help think it through). After an HA installation, the following settings are available on the 3 cinder units:

cinder/2 is the leader

root@juju-81f31a-zaza-ba0026754297-13:/var/lib/juju/agents/unit-cinder-2/charm# relation-get -r cluster:0 - cinder/0
cinder-db-initialised-echo: cinder/2-5acabdb0-dd12-46a7-a792-5a283b88fd48

root@juju-81f31a-zaza-ba0026754297-13:/var/lib/juju/agents/unit-cinder-2/charm# relation-get -r cluster:0 - cinder/1
admin-address: 172.20.1.70
cinder-db-initialised-echo: cinder/2-5acabdb0-dd12-46a7-a792-5a283b88fd48

root@juju-81f31a-zaza-ba0026754297-13:/var/lib/juju/agents/unit-cinder-2/charm# relation-get -r cluster:0 - cinder/2
admin-address: 172.20.1.157
cinder-db-initialised: cinder/2-5acabdb0-dd12-46a7-a792-5a283b88fd48

Notice that on cinder/2 (the master) there is *no* -echo key set, whereas on the non-leader units the -echo key is set (which is expected). The only unit that sets cinder-db-initialised is the leader (again expected).

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :
Download full text (4.3 KiB)

Okay, I've reproduced the issue. Essentially, I used a small deployment script:

Every 5.0s: timeout 4 juju status -m cinder --color Wed May 11 10:20:12 2022

Model Controller Cloud/Region Version SLA Timestamp
cinder tinwood2-serverstack serverstack/serverstack 2.9.27 unsupported 10:20:12Z

App Version Status Scale Charm Channel Rev Exposed Message
cinder 19.0.0 active 3 cinder stable 530 no Unit is ready
keystone 17.0.1 active 1 keystone stable 539 no Application Ready
percona-cluster 5.7.20 active 1 percona-cluster stable 302 no Unit is ready
rabbitmq-server 3.8.2 active 1 rabbitmq-server stable 123 no Unit is ready

Unit Workload Agent Machine Public address Ports Message
cinder/0* active executing 3 10.5.3.43 8776/tcp Unit is ready
cinder/1 active executing 4 10.5.2.251 8776/tcp Unit is ready
cinder/2 active executing 5 10.5.2.67 8776/tcp Unit is ready
keystone/0* active idle 0 10.5.1.134 5000/tcp Unit is ready
percona-cluster/0* active idle 1 10.5.3.32 3306/tcp Unit is ready
rabbitmq-server/0* active idle 2 10.5.3.182 5672/tcp Unit is ready

- I started it at focal/distro for cinder and keystone.
- I then forced a leadership election to move the leader to a different unit (e.g. 0 -> 1).
- I then did an upgrade from distro (ussuri) -> victoria on cinder.
- Then I forced another leadership election from 1-> 0
- I did another upgrade (victoria -> wallaby) and it was okay.
- I then forced another leadership election to get it to 2.
- I then did an upgrade from wallaby -> xena and triggered the issue.

The show unit for the 3 devices shows that each one has bean the leader and 'done' the upgrade:

      cinder/0:
        ...
        cinder-db-initialised: cinder/0-c19dc67e-ee4c-4753-9868-be0e8efa36da
        cinder-db-initialised-echo: cinder/1-9717e388-8b09-4976-9f0f-4690ee1203f2
      cinder/1:
        ...
          cinder-db-initialised: cinder/1-9717e388-8b09-4976-9f0f-4690ee1203f2
          cinder-db-initialised-echo: cinder/0-c19dc67e-ee4c-4753-9868-be0e8efa36da
      cinder/2:
        ...
          cinder-db-initialised: cinder/2-71063595-9742-4950-bad6-6a1a8a5a8ab1
          cinder-db-initialised-echo: cinder/1-9717e388-8b09-4976-9f0f-4690ee1203f2

i.e. cinder-db-initialised for each unit is that unit's own id with a UUID.

However, as Drew in the comments says, it the cinder-db-initialised-echo keeps bouncing around the units. In the above case, two agree (but this will change with the next hook).

The code in question is:

def check_local_db_actions_complete():
    """Check if we have received db init'd notification and restart services
    if we have not already.

    NOTE: this must only be called from peer relation context.
 ...

Read more...

Changed in charm-cinder:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

After debugging the endless loop, the fix is to unset the "cinder-db-initialised" relation data for the non leader units; this breaks the cycle and allows the units to settle on a single cinder-db-initalised (the leader's).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-cinder (master)
Changed in charm-cinder:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-cinder (master)

Reviewed: https://review.opendev.org/c/openstack/charm-cinder/+/841410
Committed: https://opendev.org/openstack/charm-cinder/commit/a50281db835fc139912f09fedb6e00dba9f333a5
Submitter: "Zuul (22348)"
Branch: master

commit a50281db835fc139912f09fedb6e00dba9f333a5
Author: Alex Kavanagh <email address hidden>
Date: Wed May 11 12:48:41 2022 +0100

    Stop endless cycles on openstack upgrade

    If an HA cinder deployment is upgraded twice and the leadership has
    changed, there is a potential race that can be triggered where because
    the CINDER_DB_INIT_RKEY is different on the units that have been the
    leader. This race is that when the unit tries to decide whether it has
    reflected the 'new' RKEY it sees a different one from each unit. By
    clearing the rkey, it ensures that only the relation changed event from
    the leader actually causes a restart.

    Change-Id: I712502a666f23421cc58925f10b7643e3482d861
    Closes-Bug: #1928383

Changed in charm-cinder:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.