Ceph Pacific -> Quincy Upgrade Results in Octopus Mons

Bug #2026651 reported by Boris Lukashev
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Committed
Undecided
Unassigned

Bug Description

We use Juju Ceph and Kolla Ansible Openstack, deployed Pacific to Focal using Juju + Maas, threw a Xena Kolla deployment atop it, QA'd, and then proceeded to upgrade Ceph to Quincy per the upgrade guide (charms -> OS -> Ceph).

After following the upgrade procedure (including OS), all ceph components now show as Quincy _except for the mons_ which somehow Show 15.2.0 Octopus from `ceph -v` in the mon LXDs despite the Charm channel clearly showing stable/quincy. All charms were upgraded using the new syntax: `juju refresh --switch ch:ceph-mon --channel quincy/stable ceph-mon`.

Debugging problem here is in that i dont know if the original Juju bundle installed Octopus (it has always deployed Pacific previously, maybe new Juju version changed something?) instead of Pacific for the mons despite the YAML of the original deployment being:
```
applications:
  ceph-mon:
    charm: ch:ceph-mon
    channel: pacific/stable

```

Ceph seems to be working according to the CLI, but this doesn't seem safe/proper/sane.
1. How do i get Ceph into a consistent state of all-quincy? (refreshing again tells me the charms are up to date)
2. How in the world does this even happen when there was no Octopus deployment in play whatsoever?

Revision history for this message
Boris Lukashev (rageltman) wrote :

Digging into the logs of the upgrade (juju refresh --switch ch:ceph-mon --channel quincy/stable ceph-mon), it's clearly confused:
```
unit-ceph-mon-1: 17:35:09 INFO unit.ceph-mon/1.juju-log old_version: octopus
unit-ceph-mon-1: 17:35:09 INFO unit.ceph-mon/1.juju-log new_version: octopus
unit-ceph-mon-1: 17:35:09 ERROR unit.ceph-mon/1.juju-log Invalid upgrade path from octopus to octopus. Valid paths are: ['firefly -> hammer', 'hammer -> jewel', 'jewel -> luminous', 'luminous -> mimic', 'mimic -> nautilus', 'nautilus -> octopus', 'octopus -> pacific', 'pacific -> quincy']
```

Trying to force an upgrade to Pacific via `juju refresh --switch ch:ceph-mon --channel pacific/stable ceph-mon` STILL produces the same output:
```
unit-ceph-mon-2: 17:36:53 INFO unit.ceph-mon/2.juju-log old_version: octopus
unit-ceph-mon-2: 17:36:53 INFO unit.ceph-mon/2.juju-log new_version: octopus
unit-ceph-mon-2: 17:36:53 ERROR unit.ceph-mon/2.juju-log Invalid upgrade path from octopus to octopus. Valid paths are: ['firefly -> hammer', 'hammer -> jewel', 'jewel -> luminous', 'luminous -> mimic', 'mimic -> nautilus', 'nautilus -> octopus']
```

The host OS' are all now on Jammy (per the upgrade docs, MaaS deployed them at Focal) but these LXDs are stuck on Focal according to their internal /etc/lsb-release and apparently using that distro's native packages internally:
```
ubuntu@juju-70c262-0-lxd-0:~$ dpkg -l|grep ceph
ii ceph 15.2.17-0ubuntu0.20.04.3 amd64 distributed storage and file system
ii ceph-base 15.2.17-0ubuntu0.20.04.3 amd64 common ceph daemon libraries and management tools
ii ceph-common 15.2.17-0ubuntu0.20.04.3 amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-mds 15.2.17-0ubuntu0.20.04.3 amd64 metadata server for the ceph distributed file system
ii ceph-mgr 15.2.17-0ubuntu0.20.04.3 amd64 manager for the ceph distributed file system
ii ceph-mgr-modules-core 15.2.17-0ubuntu0.20.04.3 all ceph manager modules which are always enabled
ii ceph-mon 15.2.17-0ubuntu0.20.04.3 amd64 monitor server for the ceph storage system
ii ceph-osd 15.2.17-0ubuntu0.20.04.3 amd64 OSD server for the ceph storage system
ii libcephfs2 15.2.17-0ubuntu0.20.04.3 amd64 Ceph distributed file system client library
ii python3-ceph-argparse 15.2.17-0ubuntu0.20.04.3 amd64 Python 3 utility libraries for Ceph CLI
ii python3-ceph-common 15.2.17-0ubuntu0.20.04.3 all Python 3 utility libraries for Ceph
ii python3-cephfs 15.2.17-0ubuntu0.20.04.3 amd64 Python 3 libraries for the Ceph libcephfs library
```

I'm starting to think that Juju3 might not play well with things that Juju 2.x executed perfectly well. :-\

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):

Looks like the "source" config attribute is causing this - the YAML had it set to distro since the old charms were deprecated. Attempting to change the config breaks the hooks and then it cant tell what version it had or to which version it's going.
```
unit-ceph-mon-1: 20:44:21 ERROR unit.ceph-mon/1.juju-log Invalid upgrade path from None to None. Valid paths are: ['firefly -> hammer', 'hammer -> jewel', 'jewel -> luminous', 'luminous -> mimic', 'mimic -> nautilus', 'nautilus -> octopus']
```
and then the config breaks entirely and now the cluster is in some even weirder state:
```
unit-ceph-mon-1: 21:26:42 INFO juju.worker.uniter awaiting error resolution for "config-changed" hook
unit-ceph-mon-0: 21:26:42 INFO juju.worker.uniter awaiting error resolution for "config-changed" hook
unit-ceph-mon-0: 21:26:42 INFO unit.ceph-mon/0.juju-log old_version: None
unit-ceph-mon-0: 21:26:42 INFO unit.ceph-mon/0.juju-log new_version: None
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed Traceback (most recent call last):
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/config-changed", line 1351, in <module>
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed hooks.execute(sys.argv)
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/charmhelpers/core/hookenv.py", line 962, in execute
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed self._hooks[hook_name]()
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/charmhelpers/contrib/hardening/harden.py", line 93, in _harden_inner2
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed return f(*args, **kwargs)
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/config-changed", line 248, in config_changed
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed check_for_upgrade()
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/config-changed", line 144, in check_for_upgrade
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed old_version_os < new_version_os):
unit-ceph-mon-0: 21:26:42 WARNING unit.ceph-mon/0.config-changed TypeError: '<' not supported between instances of 'NoneType' and 'str'
unit-ceph-mon-0: 21:26:43 ERROR juju.worker.uniter.operation hook "config-changed" (via explicit, bespoke hook script) failed: exit status 1
```
^^ this fine mess requires editing /var/lib/juju/agents/unit-ceph-mon-*/charm/.juju-persistent-config to restore the "source" attribute manually or the hooks scripts "pile up" until that's fixed.

It looks like ceph-mon charm _cannot convert from "distro" to "cloud:focal-xena"_ without somewhat silently breaking.
This doesnt bode too well for the production deployment which is using the old juju charmers charms which are now unsupported.

Changed in charm-ceph-mon:
status: New → Confirmed
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Thanks Boris for reporting.

Were you able to get your ceph-mon units upgraded successfully by setting `source:quincy` in the end?

Just ftr., it _should_ be possible to get out of the error state w/o having to edit .juju-persistent-config by running `juju resolved --no-retry ceph-mon/x`

We're now defaulting to `source: quincy` instead of distro and also made some improvements to version detection (cf. https://review.opendev.org/c/openstack/charm-ceph-mon/+/887733) which hopefully make the upgrade experience a little saner.

Revision history for this message
Boris Lukashev (rageltman) wrote :

Apparently its possible to get out of this mess by upgrading the LXDs manually to 22.04 which somehow upgrades Ceph directly from Octopus to Quincy via the "distro" source. This "seems unsafe" given that Ceph is supposed to go through a Pacific stage in the mons before becoming Quincy. Quite confused re how this "distro" source is supposed to work :-\.

Revision history for this message
Boris Lukashev (rageltman) wrote :

@Peter Sabaini: pardon, didn't see the response when i posted my update.
I did not try "quincy" as the source, but the "distro" one apparently ran the upgrade As the versions are now showing 17 across the board. Seems like it somehow skipped Pacific altogether which seems somewhat unnatural for a Ceph cluster; esp after the weird upgrade mess which produced Quincy OSDs on an Octopus MON set. I'll try to qualify the normal OpenStack storage actions, but having fought through messed up clusters in other places, i'm somewhat loathe to let one grow up in here with any production workload unless i can properly test & QA it.

Revision history for this message
Boris Lukashev (rageltman) wrote :

@Peter Sabaini: changing the source to "quincy" breaks package versioning - "performing upgrade to None" is the status message and the units dont go into an error state. Same silent breakage i describe in the original issue.

Digging into how juju manages "ceph state" is a bit confounding: having a local file on the target system "hold state" does not look very safe; and with all the disparities between sources, channels, and Juju's own behavior over the last couple of years, there's not a lot of confidence in performing these upgrades on prod targets which were using sources that were live 2y ago and have been deprecated and then apparently had the new source superseded without the ability to safely switch between various sources with state awareness.

Revision history for this message
Wesley Hershberger (whershberger) wrote :

Hi,

It looks like the change to `source: quincy` breaks charm deployments where base=ubuntu@20.04. To reproduce:

juju deploy --base=ubuntu@20.04 --channel=quincy/stable ceph-mon

This shows up in juju debug-log:

unit-ceph-mon-0: 15:41:52 INFO unit.ceph-mon/0.juju-log Unknown source: 'quincy'

And status reports that Octopus was installed, not quincy:

ceph-mon 15.2.17 blocked 1 ceph-mon quincy/stable 195 no Insufficient peer units to bootstrap cluster (require 3)

I've reproduced this in an Openstack cloud and in a lxd container within that cloud. I had some (probably unrelated) issues reproducing in a manually provisioned machine, but the charm got far enough to try installing ceph from the focal repos instead of the UCA.

Is it possible that source should be set to 'yoga' instead of 'quincy', as it is in ceph-mon?

[1] https://charmhub.io/ceph-osd/configure#source

Revision history for this message
Luciano Lo Giudice (lmlogiudice) wrote :

Hello all,

Indeed, the Openstack libs don't recognize Ceph releases when specifying the charm source. Instead, we have to use an Openstack release (yoga in this case).

I'll prepare a patch for ceph-mon.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Changed in charm-ceph-mon:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/903768
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/03868b2c9f9070499d079a9e64b45ebd3c4fb189
Submitter: "Zuul (22348)"
Branch: master

commit 03868b2c9f9070499d079a9e64b45ebd3c4fb189
Author: Luciano Lo Giudice <email address hidden>
Date: Fri Dec 15 13:09:17 2023 -0300

    Revert default source to 'bobcat'

    The Openstack libs don't recognize Ceph releases when specifying
    the charm source. Instead, we have to use an Openstack release.
    Since it was set to quincy, reset it to bobcat.

    Closes-Bug: #2026651
    Change-Id: Ibac09d2bf77eeba69789434eaa6112c2028fbf64

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.