OpenStack Cinder Charm

infinite cluster-relation-changed loop when upgrading 21.04 charm-cinder from bionic-queens to rocky

Bug #1928383 reported by Drew Freiberger on 2021-05-13

This bug affects 2 people

	Status	Importance	Assigned to
OpenStack Cinder Charm	Status tracked in Trunk
2023.1	Fix Released	Undecided	Unassigned
Train	Triaged	Undecided	Unassigned
Trunk	Fix Released	High	Alex Kavanagh
Ussuri	Triaged	Undecided	Unassigned
Victoria	Triaged	Undecided	Unassigned
Wallaby	Triaged	Undecided	Unassigned
Xena	Triaged	Undecided	Unassigned
Yoga	Triaged	Undecided	Unassigned
Zed	Fix Released	Undecided	Unassigned

Bug Description

I am performing a non-action-managed-upgrade of API charms for openstack from bionic distro (after a xenial-queens -> bionic series upgrade) on 21.04 openstack charms.

I have a 3 unit cinder deployment with typical hacluster and cinder-ceph subordinates.

When I run 'juju config cinder openstack-origin=cloud:bionic-rocky action-managed-upgrade=false', the upgrade appears to have updated all of the packages properly and config-changed succeeds. Then I witness a never-ending cascade effect of cluster-relation changed which appears to be coming from relation data changes from each of the units:

While one cluster-relation-changed is running, I captured the 'juju show-unit' of cinder/1 (one of the non-leaders) and the relation data for "cinder-db-initialised-echo" is constantly changing.

Here are two iterations of the data that diffed out from juju show-unit.
         cinder-db-initialised-echo: cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881
         cinder-db-initialised-echo: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774
         cinder-db-initialised-echo: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774

         cinder-db-initialised-echo: cinder/1-c1de47c8-dd95-43d6-8e6e-c5a11ef4d50f
         cinder-db-initialised-echo: cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881
         cinder-db-initialised-echo: cinder/1-c1de47c8-dd95-43d6-8e6e-c5a11ef4d50f

More context of this data from juju show-unit cinder/0:
    cinder/0:
      data:
        cinder-db-initialised: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774
        cinder-db-initialised-echo: cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881
    related-units:
      cinder/1:
        in-scope: true
        data:
          cinder-db-initialised: cinder/1-c1de47c8-dd95-43d6-8e6e-c5a11ef4d50f
          cinder-db-initialised-echo: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774
      cinder/2:
        in-scope: true
        data:
          cinder-db-initialised: cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881
          cinder-db-initialised-echo: cinder/0-40bb8ddd-488c-4562-883b-10862fd62774

It looks like there was a race that happened with the database initialisation.

Tags:

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2021-05-13:

subscribing field-critical as this is actively causing a volume service outage on a production cloud.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2021-05-13:

Worked around this by hard-setting the relation data for this initialised variable as was defined from the leader unit:

$ for i in 0 1 2; do juju run -u cinder/$i -- relation-set -r cluster:4 cinder-db-initialised=cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881 cinder-db-initialised-echo=cinder/2-0cb907fd-aa11-4486-a471-e000a1f4e881; done

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2021-05-13:

Code related to this exchange is here:
https://github.com/openstack/charm-cinder/blob/master/hooks/cinder_utils.py#L689-L756

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2021-05-13:

I'm about to do this tomorrow with ServerStack from queens to rocky. I'm going to to an action-managed-upgrade to see/verify if it has the same behaviour. If not, then this might be the recommended work-around for cinder; I'll update here after my upgrade.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2021-05-13:

Since a work around is available, dropping this to field-high - though this remains an important bug.

@afreiberger is it possible to get some sanitized logs for this?

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2021-05-14:

If I'm reading this correctly, this upgrade has missed one of the key pre-upgrade requirements(https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/upgrade-series.html#pre-upgrade-requirements): "All currently deployed charms should be upgraded to the latest stable charm revision." Over the last two years, many improvements have been made to the charms in regards to how they handle upgrading things, from the charms themselves, to OpenStack, to Ubuntu.

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2021-05-14:

I completely failed to read the 21 bit in the description, apologies!

James Troup (elmo) on 2021-06-15

tags:

added: openstack-upgrade

Aurelien Lourot (aurelien-lourot) on 2021-07-19

Changed in charm-cinder:
assignee:	nobody → Aurelien Lourot (aurelien-lourot)

Aurelien Lourot (aurelien-lourot) on 2021-08-31

Changed in charm-cinder:
assignee:	Aurelien Lourot (aurelien-lourot) → nobody

Felipe Reyes (freyes) on 2021-10-05

Changed in charm-cinder:
assignee:	nobody → Felipe Reyes (freyes)

Revision history for this message

Felipe Reyes (freyes) wrote on 2021-10-07:

Hi Drew,

I've been trying to reproduce this issue without success. I tried 2 different setups (with and without ceph).

This is how the different hooks were executed in my environment:

juju show-status-log cinder/0 http://paste.ubuntu.com/p/5B2SCCmDCx/
juju show-status-log cinder/1 http://paste.ubuntu.com/p/rWCWkZVZXq/
juju show-status-log cinder/2 http://paste.ubuntu.com/p/j7y7BZPgvv/

In case you still have the environment around it would be helpful if you could share the following information:

juju show-action-log <UNIT> # for each cinder unit.
/var/log/juju/{machine-lock.log,unit-cinder-*.log} from each cinder unit, you could grep by the date when the upgrade was run.

I will set the bug as incomplete, but please set it back to new once you have this dataset or any other information that could help us to reproduce the bug.

Best,

Changed in charm-cinder:
status:	New → Incomplete
assignee:	Felipe Reyes (freyes) → nobody

Revision history for this message

Launchpad Janitor (janitor) wrote on 2021-12-07:

[Expired for OpenStack Cinder Charm because there has been no activity for 60 days.]

Changed in charm-cinder:
status:	Incomplete → Expired

Revision history for this message

Pedro Victor Lourenço Fragola (pedrovlf) wrote on 2022-03-31:

#10

I have an environment facing the same issue after upgrading from Victoria => Wallaby => Xena

juju show-status-log => https://pastebin.canonical.com/p/nq9wsdRFvn/

I'll try the workaround suggested in comment #2

Note: After stopping one of the units(lxd) the environment is stable, in this case this was the workaround suggested in https://bugs.launchpad.net/charms/+source/cinder/+bug/1561927

Changed in charm-cinder:
status:	Expired → New

Revision history for this message

Pedro Victor Lourenço Fragola (pedrovlf) wrote on 2022-04-01:

#11

Just a follow-up, after using the workaround from comment #2 the cinder units are stable.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2022-05-09:

#12

I've done some more investigations (these are notes to help think it through). After an HA installation, the following settings are available on the 3 cinder units:

cinder/2 is the leader

root@juju-81f31a-zaza-ba0026754297-13:/var/lib/juju/agents/unit-cinder-2/charm# relation-get -r cluster:0 - cinder/0
cinder-db-initialised-echo: cinder/2-5acabdb0-dd12-46a7-a792-5a283b88fd48

root@juju-81f31a-zaza-ba0026754297-13:/var/lib/juju/agents/unit-cinder-2/charm# relation-get -r cluster:0 - cinder/1
admin-address: 172.20.1.70
cinder-db-initialised-echo: cinder/2-5acabdb0-dd12-46a7-a792-5a283b88fd48

root@juju-81f31a-zaza-ba0026754297-13:/var/lib/juju/agents/unit-cinder-2/charm# relation-get -r cluster:0 - cinder/2
admin-address: 172.20.1.157
cinder-db-initialised: cinder/2-5acabdb0-dd12-46a7-a792-5a283b88fd48

Notice that on cinder/2 (the master) there is *no* -echo key set, whereas on the non-leader units the -echo key is set (which is expected). The only unit that sets cinder-db-initialised is the leader (again expected).

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2022-05-11:

#13

Download full text (4.3 KiB)

Okay, I've reproduced the issue. Essentially, I used a small deployment script:

Every 5.0s: timeout 4 juju status -m cinder --color Wed May 11 10:20:12 2022

Model Controller Cloud/Region Version SLA Timestamp
cinder tinwood2-serverstack serverstack/serverstack 2.9.27 unsupported 10:20:12Z

App Version Status Scale Charm Channel Rev Exposed Message
cinder 19.0.0 active 3 cinder stable 530 no Unit is ready
keystone 17.0.1 active 1 keystone stable 539 no Application Ready
percona-cluster 5.7.20 active 1 percona-cluster stable 302 no Unit is ready
rabbitmq-server 3.8.2 active 1 rabbitmq-server stable 123 no Unit is ready

- I started it at focal/distro for cinder and keystone.
- I then forced a leadership election to move the leader to a different unit (e.g. 0 -> 1).
- I then did an upgrade from distro (ussuri) -> victoria on cinder.
- Then I forced another leadership election from 1-> 0
- I did another upgrade (victoria -> wallaby) and it was okay.
- I then forced another leadership election to get it to 2.
- I then did an upgrade from wallaby -> xena and triggered the issue.

The show unit for the 3 devices shows that each one has bean the leader and 'done' the upgrade:

      cinder/0:
        ...
        cinder-db-initialised: cinder/0-c19dc67e-ee4c-4753-9868-be0e8efa36da
        cinder-db-initialised-echo: cinder/1-9717e388-8b09-4976-9f0f-4690ee1203f2
      cinder/1:
        ...
          cinder-db-initialised: cinder/1-9717e388-8b09-4976-9f0f-4690ee1203f2
          cinder-db-initialised-echo: cinder/0-c19dc67e-ee4c-4753-9868-be0e8efa36da
      cinder/2:
        ...
          cinder-db-initialised: cinder/2-71063595-9742-4950-bad6-6a1a8a5a8ab1
          cinder-db-initialised-echo: cinder/1-9717e388-8b09-4976-9f0f-4690ee1203f2

i.e. cinder-db-initialised for each unit is that unit's own id with a UUID.

However, as Drew in the comments says, it the cinder-db-initialised-echo keeps bouncing around the units. In the above case, two agree (but this will change with the next hook).

The code in question is:

def check_local_db_actions_complete():
"""Check if we have received db init'd notification and restart services
if we have not already.

NOTE: this must only be called from peer relation context.
...

Okay, I've reproduced the issue.  Essentially, I used a small deployment script:

Every 5.0s: timeout 4 juju status -m cinder --color                                                                                                                               Wed May 11 10:20:12 2022

Model   Controller            Cloud/Region             Version  SLA          Timestamp
cinder  tinwood2-serverstack  serverstack/serverstack  2.9.27   unsupported  10:20:12Z

App              Version  Status  Scale  Charm            Channel  Rev  Exposed  Message
cinder           19.0.0   active      3  cinder           stable   530  no       Unit is ready
keystone         17.0.1   active      1  keystone         stable   539  no       Application Ready
percona-cluster  5.7.20   active      1  percona-cluster  stable   302  no       Unit is ready
rabbitmq-server  3.8.2    active      1  rabbitmq-server  stable   123  no       Unit is ready

Unit                Workload  Agent      Machine  Public address  Ports     Message   
cinder/0*           active    executing  3        10.5.3.43       8776/tcp  Unit is ready
cinder/1            active    executing  4        10.5.2.251      8776/tcp  Unit is ready
cinder/2            active    executing  5        10.5.2.67       8776/tcp  Unit is ready   
keystone/0*         active    idle       0        10.5.1.134      5000/tcp  Unit is ready   
percona-cluster/0*  active    idle       1        10.5.3.32       3306/tcp  Unit is ready   
rabbitmq-server/0*  active    idle       2        10.5.3.182      5672/tcp  Unit is ready

The show unit for the 3 devices shows that each one has bean the leader and 'done' the upgrade:

i.e. cinder-db-initialised for each unit is that unit's own id with a UUID.

However, as Drew in the comments says, it the cinder-db-initialised-echo keeps bouncing around the units.  In the above case, two agree (but this will change with the next hook).

The code in question is:

def check_local_db_actions_complete():
    """Check if we have received db init'd notification and restart services
    if we have not already.

NOTE: this must only be called from peer relation context.
    """
    if not is_db_initialised():
        return

settings = relation_get() or {}
    if settings:
        init_id = settings.get(CINDER_DB_INIT_RKEY)
        echoed_init_id = relation_get(unit=local_unit(),
                                      attribute=CINDER_DB_INIT_ECHO_RKEY)

# If we have received an init notification from a peer unit
        # (assumed to be the leader) then restart cinder-* and echo the
        # notification and don't restart again unless we receive a new
        # (different) notification.
        if is_new_dbinit_notification(init_id, echoed_init_id):
            if not is_unit_paused_set():
                log("Restarting cinder services following db "
                    "initialisation", level=DEBUG)
                for svc in enabled_services():
                    service_restart(svc)

# Echo notification
            relation_set(**{CINDER_DB_INIT_ECHO_RKEY: init_id})

What I think is happening is that the "init_id = settings.get(CINDER_DB_INIT_RKEY)" assignment is getting a different "cinder-db-initialised" depending on the unit.

I'll debug that and work out how to fix it.

Changed in charm-cinder:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Alex Kavanagh (ajkavanagh)

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2022-05-11:

#14

After debugging the endless loop, the fix is to unset the "cinder-db-initialised" relation data for the non leader units; this breaks the cycle and allows the units to settle on a single cinder-db-initalised (the leader's).

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-11: Fix proposed to charm-cinder (master)

#15

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-cinder/+/841410

Changed in charm-cinder:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-16: Fix merged to charm-cinder (master)

#16

Reviewed: https://review.opendev.org/c/openstack/charm-cinder/+/841410
Committed: https://opendev.org/openstack/charm-cinder/commit/a50281db835fc139912f09fedb6e00dba9f333a5
Submitter: "Zuul (22348)"
Branch: master

commit a50281db835fc139912f09fedb6e00dba9f333a5
Author: Alex Kavanagh <email address hidden>
Date: Wed May 11 12:48:41 2022 +0100

Stop endless cycles on openstack upgrade

    If an HA cinder deployment is upgraded twice and the leadership has
    changed, there is a potential race that can be triggered where because
    the CINDER_DB_INIT_RKEY is different on the units that have been the
    leader. This race is that when the unit tries to decide whether it has
    reflected the 'new' RKEY it sees a different one from each unit. By
    clearing the rkey, it ensures that only the relation changed event from
    the leader actually causes a restart.

Change-Id: I712502a666f23421cc58925f10b7643e3482d861
Closes-Bug: #1928383

Changed in charm-cinder:
status:	In Progress → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.