OpenStack Cinder Charm

use "cluster" config instead of same "host" config value

Bug #1945239 reported by Rodrigo Barbieri on 2021-09-27

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Cinder Charm	In Progress	Undecided	Unassigned

Bug Description

For stateless backends and/or to enable Active-active HA, it is ideal to cluster cinder-volume services as one single service.

This is possible to be accomplished by:

a) using the same "host = <name>" config value on all cinder-volume nodes

b) using the "cluster" config

However, approach (a) has drawbacks [1] and is not recommended by the upstream cinder community [0][1], while approach (b) was designed [1][2] to be used for this purpose.

For testing the impact of changing this for the cinder charm, I performed the following tests:

As of today, host = cinder, before changing:

1) create an empty vol vol1
2) create snapshot of the empty vol1 snap1
3) created a vol from image cirros named cirros_vol1
4) created a snapshot from cirros_vol1 named cirros_vol1_snap
5) created another vol from image cirros named cirros_vol2
6) created a VM booting from cirros_vol2 named VM1
7) listed services (1 service cinder@cinder-ceph)
8) tested VM1 is writeable

Then, I proceeded to edit all cinder.conf of the 3 cinder-volume nodes, commenting out #host = cinder and adding cluster = cinder. Stopped jujud processes on all 3 nodes as well so the file is not overwritten. Restarted all 3 cinder-volume services at the same time.

Tests after change:

1) created an empty vol vol2
- success creating
- host field is cinder@cinder-ceph#cinder-ceph, the same as vol1
2) created another snapshot of vol1
- success
3) created a VM booting from cirros_vol1 named VM2
- took a long time, but success
4) VM2 is writeable
5) VM1 is still writeable
6) listed services (3 separated services with different names, cinder@cinder-ceph marked as down)
7) rebooted VM1
- success, data still there, and still writeable
8) Deleted VM1
- success, cirros_vol1 detached
9) Created another snapshot of cirros_vol1 named cirros_vol1_snap2
- success
10) Attached cirros_vol1 to VM2
- success, mounted volume, data is there and writeable

I did not see any noticeable impact transitioning from host to cluster config other than the old service appearing as down (ideally it should be removed).

[0] https://meetings.opendev.org/irclogs/%23openstack-cinder/%23openstack-cinder.2021-09-07.log.html#t2021-09-07T14:40:02
[1] https://github.com/openstack/cinder-specs/blob/master/specs/newton/ha-aa-cleanup.rst
[2] https://github.com/openstack/cinder-specs/blob/master/specs/ocata/ha-aa-job-distribution.rst

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-28: Fix proposed to charm-cinder (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-cinder/+/811472

Changed in charm-cinder:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-01: Fix merged to charm-cinder (master)

Reviewed: https://review.opendev.org/c/openstack/charm-cinder/+/811472
Committed: https://opendev.org/openstack/charm-cinder/commit/2603a774b112dc6e25f9e7b2071124ea3d3d0012
Submitter: "Zuul (22348)"
Branch: master

commit 2603a774b112dc6e25f9e7b2071124ea3d3d0012
Author: Rodrigo Barbieri <email address hidden>
Date: Tue Sep 28 15:37:15 2021 -0300

Group cinder-volume services using cluster config

    Usage of "host" config for grouping cinder-volume
    services for the purpose of stateless config and/or
    active-active HA is not recommended.

    The usage of "cluster" config is recommended for
    achieving that behavior. This change replaces the
    usage of "host" with "cluster".

Change-Id: I249ce179cfdaf2394ee4e8481a8c9644d667ea99
Closes-bug: #1945239

Changed in charm-cinder:
status:	In Progress → Fix Committed

James Page (james-page) on 2021-10-07

Changed in charm-cinder:
milestone:	none → 21.10

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-21:

This is actually too drastic change from a charm ecosystem point of view and will break a backward compatibility although the patch itself is in a good intention and nothing is wrong from an upstream config point of view.

I'd really like to see the follow-up change gets merged before the change gets into stable as a part of 21.10 release.
https://review.opendev.org/c/openstack/charm-cinder/+/814845

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-10-21:

Hi Nobuto, for pure-storage they should report stateless=False for releases they don't support AA HA. See discussion at https://review.opendev.org/c/openstack/charm-cinder/+/814845

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-21:

> Hi Nobuto, for pure-storage they should report stateless=False for releases they don't support AA HA. See discussion at https://review.opendev.org/c/openstack/charm-cinder/+/814845

I get your point and indeed "stateless=True" was misused for some time. But the problems here are:

1. storage backend setup by a subordinate charm (previously stateless=True) will stop working just after upgrading the cinder charm with "Volume driver reported an error: Active-Active configuration is not currently supported by driver". Thus I stated as a backward compatibility even if the original value was misused.

This needs to be addressed in each charm in a coordinated way (we can't release cinder first without waiting for *all* of the supported charms report their support status correctly, and the coordinate upgrade must be covered in the release notes in my opinion.

2. Have you considered the multi-backend scenarios and mixture of stateless and statefull backend? Let's say focal-ussuri with Ceph(rbd) and Purestorage.

With the new behavior and if charm-purestorage reports stateless=False, what happens to volume service status of Ceph? Previously host= was in cinder.conf but with the new behavior, neither of host= or cluster= will be in cinder.conf while ceph expects one of those?

I may be too scared of the change probably but wanted to raise my concern here.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-10-21:

Hi Nobuto,

for (2), backends with different clustering config should be grouped in different cinder charm apps. For example:

charm: cinder, app: cinder-api
charm: cinder, app: cinder-volume-aa-drivers
charm: cinder-ceph, subordinate of cinder-volume-aa-drivers
charm: cinder, app: cinder-volume-ap-drivers
charm: cinder-purestorage, subordinate of cinder-volume-ap-drivers
charm: cinder-netapp, subordinate of cinder-volume-ap-drivers

"ap" above stands for "active-passive". "aa" for "active-active"

I have seen an environment configured as above without issues.

for (1), you are correct, and this was discussed when I was implementing the change [0], however, we didn't consider backwards compatibillity because the only cinder driver that came to our minds and would not be affected was the cinder-netapp one. We missed the cinder-purestorage one that was already existing. Let's enumerate the scenarios:

a) pure-storage deployment >= victoria: this will be changed to use cluster instead of host but will remain in clustered mode without errors, as the driver >= victoria supports AA HA

b) pure-storage deployment < victoria: this will be changed to stateless=false and therefore the environment will be affected. Ideally, since the environment is running in an unsupported config, the operators should try to migrate to a non-clustered mode and active-passive set up. The workaround for backwards compatibility that I can see here is a charm config option (on the cinder-purestorage sub-charm) to override the stateless config to address this case.

@Freyes is working on a patch as part [1] and I believe that workaround could be included.

[0] https://review.opendev.org/c/openstack/charm-cinder/+/811472
[1] https://bugs.launchpad.net/charm-cinder-purestorage/+bug/1947702

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-21: Re: [Bug 1945239] Re: use "cluster" config instead of same "host" config value

Did we actually test purestorage for example with stateless=False including
a fail-over scenarios?

cinder-volume units associated with purestorage doesn't have any state
actually although it's not call out active-active supported explicitly
until Victoria. If we deploy it with focal-ussuri, and create a volume with
cinder-volume-unit-a (which is chosen by cinder scheduler), then unit-a is
down out of 3 units, can we operate the volume still?

I thought the specific host name of unit-a was recorded in cinder db if
there is no common host= option.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-21:

2021年10月21日(木) 23:35 Rodrigo Barbieri <email address hidden>:

> Hi Nobuto,
>
> for (2), backends with different clustering config should be grouped in
> different cinder charm apps. For example:
>
> charm: cinder, app: cinder-api
> charm: cinder, app: cinder-volume-aa-drivers
> charm: cinder-ceph, subordinate of cinder-volume-aa-drivers
> charm: cinder, app: cinder-volume-ap-drivers
> charm: cinder-purestorage, subordinate of cinder-volume-ap-drivers
> charm: cinder-netapp, subordinate of cinder-volume-ap-drivers
>
> "ap" above stands for "active-passive". "aa" for "active-active"
>
> I have seen an environment configured as above without issues.

Is the requirement called out explicitly somewhere? It looks like another
drawback from the previous behavior.

It might be a corner case but FC backend requires physical hosts and if we
want to have two backend respectively stateless=True and False, do we need
4 physical hosts at least just to run cinder volume services (2 for one
backend for redundancy + 2 for another backend)? That may break a design of
existing deployments probably.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-10-21:

Hi Nobuto,

We didn't test scenarios where the flag flips. We don't want this to happen. Deployments that are running purestorage with host=<same_value> shouldn't be running it this way, as per upstream recommendations. We have not investigated steps to migrate an env that was previously deployed with host=<same_value> to host=<diff_value>.

I'd assume purestorage upstream Cinder CI would have tested Active-passive failover, given that they didn't support AA HA until Victoria. For the charm, I believe we don't have functional tests covering any type failure scenario, AA or AP.

The deployment recommendations of separate cinder charm apps for AA and AP hasn't been documented yet in our charm guides. But yes, your example is correct, you will need 4 nodes for that case because we cannot mix stateless=False and True in the same charm app.

Please let us know more details of the instances where it will impact existing envs so we can work that out.

Revision history for this message

Felipe Reyes (freyes) wrote on 2021-10-21:

#10

On Thu, 2021-10-21 at 15:24 +0000, Rodrigo Barbieri wrote:
> Hi Nobuto,
>
> We didn't test scenarios where the flag flips. We don't want this to
> happen. Deployments that are running purestorage with
> host=<same_value>
> shouldn't be running it this way, as per upstream recommendations.

For OpenStack <=Ussuri,>=Pike we should be using the host=<value> for
purestorage (the behavior until 21.04), because that driver didn't
enable AA-HA before Victoria, that's why it's getting into error state.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2021-10-21:

#11

that's why I like your approach in https://review.opendev.org/c/openstack/charm-cinder/+/814845 to have a state where host= is enabled not cluster= by having a separate flag as active_active_ha in addition to the existing "stateless" flag.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-10-21:

#12

@Freyes: I am not sure I understand your comment, because in my understanding it contradicts it self:

a) host=<value> (or not setting it at all) is for active-passive
b) (the behavior until 21.04) is host=<same_value>.

host=<same_value> is when we want to cluster the cinder-volume services in the way that is not supported/recommended by upstream

If you are saying that we *should* be using host=<same_value> for purestorage <= Ussuri,>=Pike, this is incorrect in case of new deployments. For existing deployments sure something should be done to be able to maintain the existing config (host=<same_value>) so the flag doesn't flip and doesn't cause issues (mentioned below)

@Nobuto: that patch is good workaround to achieve a way to maintain the existing config for purestorage. I will further develop this in the other LP.

Rodrigo Barbieri (rodrigo-barbieri2010) on 2021-10-22

tags:

added: sts

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-22: Related fix proposed to charm-cinder (master)

#13

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-cinder/+/815087

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-22: Related fix merged to charm-cinder (master)

#14

Reviewed: https://review.opendev.org/c/openstack/charm-cinder/+/815087
Committed: https://opendev.org/openstack/charm-cinder/commit/c5896a884a54d0171415d79a82a284e2ff6b2ca9
Submitter: "Zuul (22348)"
Branch: master

commit c5896a884a54d0171415d79a82a284e2ff6b2ca9
Author: Billy Olsen <email address hidden>
Date: Thu Oct 21 18:32:07 2021 -0700

Revert "Group cinder-volume services using cluster config"

    This reverts commit 2603a774b112dc6e25f9e7b2071124ea3d3d0012, which
    converted cinder to use cluster configuration for stateless services.
    While well-intentioned, this causes problems due to the lack of a
    proper migration for those backends which are not able to support
    ACTIVE-ACTIVE configuration.

Configurations that do not support ACTIVE_ACTIVE driver will have the
volume service fail to start, which will cause a service outage.

Change-Id: I7bf4baaf80e5bb58b5c1cf55a2065f3bc50dbced
Related-Bug: #1945239

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-22: Related fix proposed to charm-cinder (stable/21.10)

#15

Related fix proposed to branch: stable/21.10
Review: https://review.opendev.org/c/openstack/charm-cinder/+/814939

Alex Kavanagh (ajkavanagh) on 2021-10-22

Changed in charm-cinder:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-22: Related fix merged to charm-cinder (stable/21.10)

#16

Reviewed: https://review.opendev.org/c/openstack/charm-cinder/+/814939
Committed: https://opendev.org/openstack/charm-cinder/commit/47d76283c5c7c121f39c80aa6527b39bd7e6a5dd
Submitter: "Zuul (22348)"
Branch: stable/21.10

commit 47d76283c5c7c121f39c80aa6527b39bd7e6a5dd
Author: Billy Olsen <email address hidden>
Date: Thu Oct 21 18:32:07 2021 -0700

Revert "Group cinder-volume services using cluster config"

Configurations that do not support ACTIVE_ACTIVE driver will have the
volume service fail to start, which will cause a service outage.

    Change-Id: I7bf4baaf80e5bb58b5c1cf55a2065f3bc50dbced
    Related-Bug: #1945239
    (cherry picked from commit c5896a884a54d0171415d79a82a284e2ff6b2ca9)

Revision history for this message

Felipe Reyes (freyes) wrote on 2021-10-22:

#17

Setting back to 'in progress' since we reverted the patch due to regressions found.

Changed in charm-cinder:
status:	Fix Released → In Progress
milestone:	21.10 → none

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.