Migration from pacific (cloud:focal-xena) to quincy (cloud:focal-yoga) fails with invalid migration path

Bug #2007976 reported by Diko Parvanov
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Status tracked in Trunk
Quincy.2
Fix Released
Undecided
Unassigned
Trunk
Fix Released
High
Unassigned

Bug Description

Running on focal with pacific fully upgraded from octopus via source (cloud:focal-wallaby -> cloud:focal-xena) worked. Upgrading further from cloud:focal-xena to cloud:focal-yoga fails with:

unit-ceph-mon-2: 12:07:17 INFO unit.ceph-mon/2.juju-log old_version: octopus
unit-ceph-mon-2: 12:07:17 INFO unit.ceph-mon/2.juju-log new_version: quincy
unit-ceph-mon-2: 12:07:17 ERROR unit.ceph-mon/2.juju-log Invalid upgrade path from octopus to quincy. Valid paths are: ['firefly -> hammer', 'hammer -> jewel', 'jewel -> luminous', 'luminous -> mimic', 'mimic -> nautilus', 'nautilus -> octopus', 'octopus -> pacific', 'pacific -> quincy']

Using quincy/stable charms rev 149.

Ceph cluster (mons and osds) fully upgraded to pacific.

ceph -v
ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)

sudo ceph osd dump | grep require_osd_release
require_osd_release pacific

Tags: bseng-1078
Diko Parvanov (dparv)
description: updated
Revision history for this message
Diko Parvanov (dparv) wrote :

Setting back to cloud:focal-xena we get in the logs:

unit-ceph-mon-0: 13:40:53 INFO unit.ceph-mon/0.juju-log old_version: octopus
unit-ceph-mon-0: 13:40:53 INFO unit.ceph-mon/0.juju-log new_version: pacific

but ceph is already on pacific, so it's getting the old_version from somewhere that's not up to date. I checked config keys, relation data, leader settings - none of them have it.

Revision history for this message
Diko Parvanov (dparv) wrote :

Looking at the charm code it looks like in check_for_upgrade() the c.previous('source') in
old_version = ceph.resolve_ceph_version(c.previous('source') or 'distro')

return distro and it doesn't return the previous value of cloud:focal-xena, thus causing distro to be octopus for focal and failing the upgrade.

Revision history for this message
Diko Parvanov (dparv) wrote :

This ugly "patch" unblocked the migrations https://pastebin.canonical.com/p/Sw5dR4zk4g/ and then we reverted it back.

Changed in charm-ceph-mon:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Luciano Lo Giudice (lmlogiudice) wrote :

Hello Diko,

Can you tell me if the following message appears in the ceph-mon logs? "Found but was unable to parse previous config data"

The reason I asked is that such message will show when the previous configuration couldn't be loaded, which would explain why fetching the previous series is failing.

Andrea Ieri (aieri)
tags: added: bseng-1078
Revision history for this message
Diko Parvanov (dparv) wrote :

I have not seen this log entry, and unfortunately this was partner cloud and re-deployed a couple of times already, so no logs are available.

Revision history for this message
JamesLin (jneo8) wrote (last edit ):

I think .juju-persistent-config is broken in some of the release channels(stable channel looks fine). I can easily reproduce it on quincy/stable:

https://pastebin.ubuntu.com/p/mvWdTprj7g/

The .juju-persistent-config file not update after config change.

So the charm can still read previous config, but it's the wrong value, which will broke the process in ceph upgrade because we use previous config to compare the version.

I need to trace to check where make it broken, currently stable channel is fine, so it's seems like the bug in charm's code.

Revision history for this message
Diko Parvanov (dparv) wrote :

IMO I don't think the charm should rely on previous config at all, but rather parse the actual install package version from ceph --version, for e.g.:
ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)

Revision history for this message
JamesLin (jneo8) wrote (last edit ):

So here we have two issues:

1. juju-persistent-config
2. The way we compare old, new versions

Revision history for this message
JamesLin (jneo8) wrote (last edit ):

I think I found the reason why persistent is not update.

If you deploy the charm in stable/quincy channel.
And you check the dispatch file in `/var/lib/juju/agents/unit-ceph-mon-0/charm/dispatch`
It shows it will execute `./src/charm.py`, which is an operator charm.

I'm not sure how this magic happen but in the end it will trigger config-change hook. But the logic of save previous config to persistent file is handle by charmhelper.core.hookenv.Hooks after the hook finish, check here [0]. Obviously the Hooks.execute function is not been triggered when we trigger hook like this way.

# Solution?

1. We can simply add a config save in the end of config change hook, which will update the persistent config.
2. Like Diko suggested, we can change the way how we parse the current ceph version.

---

[0]: https://git.launchpad.net/charm-helpers/tree/charmhelpers/core/hookenv.py#n957

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

If this is about using hookenv.Config from an ops framework charm, then yes, the charm code will need to call hookenv._run_atstart() and hookenv._run_atexit() manually, as the ops framework doesn't do it automatically.

In classic charms, the class hookenv.Hook() makes the calls automatically, and in charms.reactive, the charms.reactive.__init__.py file calls hookenv._run_atexit() automatically as well.

Please also see this bug for more details: https://github.com/juju/charm-helpers/issues/772

So, @jneo8, good sleuthing! I'd say your suspicions are almost certainly correct.

The fix is probably to add a call to _run_at_exit() in a 'pre_commit' or 'commit' handler on the Ops framework.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/881693

Revision history for this message
JamesLin (jneo8) wrote :

Hi @Alex, I just created a PR for persistent config, the upgrading should be fine after this fix, but should we discuss the part: "how we compare the ceph version"?

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

@jneo8; hi, I've added a comment/vote on the review. To recap; this type of work needs to be fixed on the main development branch first, and then backported to stable branches. I'm going to defer to Chris/Peter about the changes to the ceph charms, as they are the main developers on them. Thanks!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Changed in charm-ceph-mon:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-mon (master)

Change abandoned by "JamesLin <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/881792
Reason: Wrong change-id

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/881793
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/e99c38ae4cc6c89aa7818d38990cc9b2dfb5060d
Submitter: "Zuul (22348)"
Branch: master

commit e99c38ae4cc6c89aa7818d38990cc9b2dfb5060d
Author: jneo8 <email address hidden>
Date: Thu Apr 27 14:31:39 2023 +0800

    Fix persistent config file not update bug

    When ceph doing the version upgrade, it will check the previous ceph
    from the `source` config variable which store in persistent file.
    But the persistent file update is broken. It is because we use hookenv.Config
    from ops framework, but the hookenv._run_atexit, which
    save the change to file, is not been called.

    Partial-Bug: #2007976
    Change-Id: Ibf12a2b87736cb1d32788672fb390e027f15b936
    func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/1047

Eric Chen (eric-chen)
Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-mon (stable/quincy.2)

Change abandoned by "JamesLin <email address hidden>" on branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/881693
Reason: wrong process for backport.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/885477

Revision history for this message
Boris Lukashev (rageltman) wrote :

I think this (https://bugs.launchpad.net/charm-ceph-mon/+bug/2026651) might be related - apparently the Charms deploy Octopus when source is set to 'distro' and channel is 'pacific/stable' while the OS is Focal. Following the upgrade docs creates a problem wherein the charms "refresh to latest," saying "pacific" but showing 15.x... subsequently everything goes off the rails since the OS upgrade step permits the "distro" version to go up to Quincy but only the OSDs can actually upgrade (the mon and radosgw upgrades "succeed" but still show 15.x) to 17.x.

Attempting to change the channel to any of the cloud:os-stackver options breaks the version detection, showing "None" as the current version as well as "None" for the upgrade target.

Revision history for this message
Gabriel Cocenza (gabrielcocenza) wrote :

I also found a bug that looks like this and described at https://bugs.launchpad.net/charm-ceph-mon/+bug/2007976

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.