OpenStack Charms Deployment Guide

[series-upgrade] "Series upgrade OpenStack" is wrong with respect to which unit to upgrade first

Bug #1934764 reported by Alex Kavanagh on 2021-07-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Charms Deployment Guide	Fix Released	High	Peter Matulis

Bug Description

In HA, the guide indicates in the "Generalised OpenStack series upgrade" section:

The steps are as follows:

Set the default series for the principal application and ensure the same has been done to the model.

If hacluster is used, pause the hacluster units not associated with the principal leader machine.

Pause the principal non-leader units.

Perform a series upgrade on the principal leader machine.

If the operator does this then the service will be taken off-line.

In reality, the remaining machine (that is not paused) has the VIP and is continuing to provide a service. The operator should upgrade the two paused machines first, and when they are both on-line, one of them will claim the VIP. The 3rd machine's principle can then be paused, and upgraded.

In this way, service can be maintained during an upgrade.

Revision history for this message

Peter Matulis (petermatulis) wrote on 2021-07-14:

Alex, why do you say that after bringing the paused (and now upgraded) units back online one of them will claim the VIP? Wouldn't that happen only once:

1. the hacluster units associated with the now-upgraded principle units are resumed

and

2. the hacluster unit associated with the remaining (non-upgraded) unit is paused

Then the latter unit can be paused and upgraded?

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2021-07-14:

Peter, thanks for picking this up. So referring to your two questions:

> the hacluster units associated with the now-upgraded principle units are resumed

So with series-upgrade, the post-series-upgrade hook with hacluster (and the principle) automatically resumes the unit. When 2 units have been upgraded, one of them automatically claims the VIP.

> the hacluster unit associated with the remaining (non-upgraded) unit is paused

Nope, because hacluster automatically takes the cluster offline when any of the hacluster units runs a pre-series-upgrade hook. So at that point, no unit is claiming the VIP, it just 'stays' where it is.

i.e. when the first unit it paused, the entire cluster is disabled and the VIP stays where it is.

So the order if operations (with hacluster) is:

a) Pause any unit + hacluster - if it had the VIP, it's handed off to one of the other units.
b) Pause another unit + hacluster - if it got the VIP, it is passed to the remaining unit.
c) pre-series-upgrade on a paused unit - hacluster is disabled for the application on ALL units. No VIP transfers will take place until two new upgraded units are available. Hope the 3rd unit stays up.

Note the remaining unit doesn't need to be the 'leader'. hacluster ensures that an unpaused unit gets the VIP.

The key information here is that the pre-series-upgrade hook on hacluster DISABLES hacluster for all units (regardless of whether they are paused or not). Therefore, for the duration of the series upgrade, the VIP can't move, and it requires two units to be upgraded before the VIP can move again.

Also, there is no way to series upgrade a single unit and leave 2 units providing the service, as there's no (charm) method of moving the VIP between remaining units.

Hope that helps.

Peter Matulis (petermatulis) on 2021-11-02

Changed in charm-deployment-guide:
assignee:	nobody → Peter Matulis (petermatulis)
importance:	Undecided → High
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-02: Fix proposed to charm-deployment-guide (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-deployment-guide/+/816395

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-04: Fix merged to charm-deployment-guide (master)

Reviewed: https://review.opendev.org/c/openstack/charm-deployment-guide/+/816395
Committed: https://opendev.org/openstack/charm-deployment-guide/commit/3f969eff1f4ef049ed8cd862000a12475b71cc6e
Submitter: "Zuul (22348)"
Branch: master

commit 3f969eff1f4ef049ed8cd862000a12475b71cc6e
Author: Peter Matulis <email address hidden>
Date: Tue Nov 2 15:14:45 2021 -0400

Fix lp1934764 - series upgrade order

Closes-Bug: #1934764
Change-Id: I06f72fc03c5a65f89a4b01f783beed4450e502d3

Changed in charm-deployment-guide:
status:	In Progress → Fix Released

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2021-12-13:

Somewhere along the line, this broke rather badly. It turns out it's a bit random as to whether the VIP will stay working with the remaining machine that bringing the API service down during a series upgrade. It would appear there is no way of guaranteeing API availability during a series upgrade, and planned downtime is required.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2021-12-13:

@ajkvanagh, is this largely due to the fact that the underlying pacemaker services are a major version upgrade which does not support rolling cluster upgrades?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-17: Related fix merged to charm-deployment-guide (master)

Reviewed: https://review.opendev.org/c/openstack/charm-deployment-guide/+/818848
Committed: https://opendev.org/openstack/charm-deployment-guide/commit/a655378cca96f8f2e518eca6fa5a32cf88a4626b
Submitter: "Zuul (22348)"
Branch: master