Ceph Monitor Charm

in _all_ceph_versions_same: if len(versions_dict['osd']) < 1: KeyError: 'osd'

Bug #2058636 reported by Nobuto Murata on 2024-03-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceph Monitor Charm	Fix Released	Undecided	Unassigned
	Quincy.2	Fix Released	Undecided	Unassigned

Bug Description

ceph-mon quincy/stable 201

There seems to be a race condition in a deployment. I saw twice in a row today and I'm not sure if it's related to the SRU completed recently or not.

In any case, I believe I saw a similar (or the same?) failure before in CI: https://review.opendev.org/c/openstack/charm-ceph-mon/+/896951

unit-ceph-mon-2: 09:58:15 INFO unit.ceph-mon/2.juju-log mon:1: Executing post-ceph-osd upgrade commands.
unit-ceph-mon-2: 09:58:15 ERROR unit.ceph-mon/2.juju-log mon:1: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/./src/charm.py", line 317, in <module>
    main(CephMonCharm)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/./src/charm.py", line 132, in on_mon_relation
    if hooks.mon_relation():
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/ceph_hooks.py", line 558, in mon_relation
    notify_relations()
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/ceph_hooks.py", line 631, in notify_relations
    notify_osds()
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/ceph_hooks.py", line 652, in notify_osds
    osd_relation(relid=relid, unit=unit)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/ceph_hooks.py", line 897, in osd_relation
    execute_post_osd_upgrade_steps(ceph_osd_releases[0])
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/utils.py", line 321, in execute_post_osd_upgrade_steps
    if (_all_ceph_versions_same() and
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/utils.py", line 352, in _all_ceph_versions_same
    if len(versions_dict['osd']) < 1:
KeyError: 'osd'
unit-ceph-mon-2: 09:58:16 ERROR juju.worker.uniter.operation hook "mon-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

$ juju show-status-log ceph-mon/2
Time Type Status 21 Mar 2024 09:57:29Z juju-unit executing 21 Mar 2024 09:57:30Z workload blocked 21 Mar 2024 09:57:31Z juju-unit executing 21 Mar 2024 09:57:31Z workload maintenance 21 Mar 2024 09:57:54Z workload maintenance 21 Mar 2024 09:58:16Z juju-unit error 21 Mar 2024 09:58:21Z juju-unit executing 21 Mar 2024 09:58:31Z juju-unit executing 21 Mar 2024 09:58:39Z juju-unit executing 21 Mar 2024 09:58:46Z workload waiting 21 Mar 2024 09:58:47Z juju-unit executing 21 Mar 2024 10:01:08Z juju-unit executing 21 Mar 2024 10:01:21Z juju-unit executing 21 Mar 2024 10:01:32Z juju-unit 21 Mar 2024 10:01:59Z juju-unit executing 21 Mar 2024 10:02:02Z juju-unit executing 21 Mar 2024 10:02:04Z juju-unit 21 Mar 2024 10:02:08Z juju-unit executing 21 Mar 2024 10:02:22Z juju-unit 21 Mar 2024 10:53:51Z workload active Message
running mon-relation-joined hook for ceph-mon/1
Unit not clustered (no quorum)
running mon-relation-changed hook for ceph-mon/1
Bootstrapping MON cluster
Bootstrapping Ceph MGR
hook failed: "mon-relation-changed"
running mon-relation-changed hook for ceph-mon/1
running osd-relation-joined hook for ceph-osd/1
running osd-relation-changed hook for ceph-osd/1
Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
running osd-relation-joined hook for ceph-osd/2
running osd-relation-changed hook for ceph-osd/2
running osd-relation-changed hook for ceph-osd/0
idle
running mds-relation-joined hook for ceph-fs/0
running mds-relation-changed hook for ceph-fs/0
idle
running mds-relation-changed hook for ceph-fs/0
idle
Unit is ready and clustered

Revision history for this message

Luciano Lo Giudice (lmlogiudice) wrote on 2024-03-21:

Nobuto, are you seeing this during deploy time or when a specific hook or event runs? We currently have some code in place that retries the part that failed because of race conditions, so it could be a matter of increasing the retries/backoff, or use a different strategy altogether.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2024-03-22:

The error is in the deployment time especially when bootstrapping the MON cluster.

21 Mar 2024 09:57:31Z workload maintenance Bootstrapping MON cluster
21 Mar 2024 09:57:54Z workload maintenance Bootstrapping Ceph MGR
21 Mar 2024 09:58:16Z juju-unit error hook failed: "mon-relation-changed"

I don't think https://review.opendev.org/c/openstack/charm-ceph-mon/+/896951 was backported to the stable branches. Doing so would mitigate the issue.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-22: Fix proposed to charm-ceph-mon (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/913941

Revision history for this message

Nobuto Murata (nobuto) wrote on 2024-03-22:

Download full text (4.8 KiB)

It's observed in the CI job too:

https://review.opendev.org/c/openstack/charm-ceph-mon/+/913814/1#message-0ca38a491712c9c6811160afa15c12337aca3ece

It's observed in the CI job too:

https://review.opendev.org/c/openstack/charm-ceph-mon/+/913814/1#message-0ca38a491712c9c6811160afa15c12337aca3ece

2024-03-22 00:14:23.622503 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: 2024-03-22 00:14:20 ERROR unit.ceph-mon/1.juju-log server.go:316 mon:0: Uncaught exception while in charm code:
2024-03-22 00:14:23.622591 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: Traceback (most recent call last):
2024-03-22 00:14:23.622706 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/./src/charm.py", line 317, in <module>
2024-03-22 00:14:23.622795 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     main(CephMonCharm)
2024-03-22 00:14:23.622909 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/main.py", line 436, in main
2024-03-22 00:14:23.623006 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     _emit_charm_event(charm, dispatcher.event_name)
2024-03-22 00:14:23.623125 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/main.py", line 144, in _emit_charm_event
2024-03-22 00:14:23.623214 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     event_to_emit.emit(*args, **kwargs)
2024-03-22 00:14:23.623309 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/framework.py", line 351, in emit
2024-03-22 00:14:23.623396 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     framework._emit(event)
2024-03-22 00:14:23.623483 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/framework.py", line 853, in _emit
2024-03-22 00:14:23.623570 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     self._reemit(event_path)
2024-03-22 00:14:23.623658 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/framework.py", line 942, in _reemit
2024-03-22 00:14:23.623778 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     custom_handler(event)
2024-03-22 00:14:23.623879 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/./src/charm.py", line 132, in on_mon_relation
2024-03-22 00:14:23.623993 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     if hooks.mon_relation():
2024-03-22 00:14:23.624085 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/src/ceph_hooks.py", line 558, in mon_relation
2024-03-22 00:14:23.624172 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     notify_relations()
2024-03-22 00:14:23.624260 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/src/ceph_hooks.py", line 631, in notify_relations
2024-03-22 00:14:23.624361 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     notify_osds()
2024-03-22 00:14:23.624464 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/src/ceph_hooks.py", line 652, in notify_osds
2024-03-22 00:14:23.624554 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     osd_relation(relid=relid, unit=unit)
2024-03-22 00:14:23.624642 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/src/ceph_hooks.py", line 897, in osd_relation
2024-03-22 00:14:23.624729 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     execute_post_osd_upgrade_steps(ceph_osd_releases[0])
2024-03-22 00:14:23.624830 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/src/utils.py", line 321, in execute_post_osd_upgrade_steps
2024-03-22 00:14:23.624934 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     if (_all_ceph_versions_same() and
2024-03-22 00:14:23.625044 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:   File "/var/lib/juju/agents/unit-ceph-mon-1/charm/src/utils.py", line 352, in _all_ceph_versions_same
2024-03-22 00:14:23.625166 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log:     if len(versions_dict['osd']) < 1:
2024-03-22 00:14:23.625262 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: KeyError: 'osd'
2024-03-22 00:14:23.625594 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: 2024-03-22 00:14:21 ERROR juju.worker.uniter.operation runhook.go:153 hook "mon-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Revision history for this message

Nobuto Murata (nobuto) wrote on 2024-03-29:

Subscribing ~field-high. Since it's blocking landing other fixes to the stable branch for more than 7 days now.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-04-01: Fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/913941
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/f3d290b55dce0a44b638ed600c8192472874f65f
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit f3d290b55dce0a44b638ed600c8192472874f65f
Author: Peter Sabaini <email address hidden>
Date: Fri Sep 29 15:30:00 2023 +0200

Fix version retrieval

    During cluster deployment a situation can arise where there are
    already osd relations but osds are not yet fully added to the cluster.
    This can make version retrieval fail for osds. Retry version retrieval
    to give the cluster a chance to settle.

    Conflicts:
            src/utils.py
            tests/bundles/jammy-zed.yaml

Closes-Bug: #2058636

Change-Id: I12a1bcd32be2ed8a8e5ee0e304f716f5a190bd57
(cherry picked from commit 55beb2504d3ea6d7f522d8d9a46bef7d741f1edc)

Nobuto Murata (nobuto) on 2024-04-01

Changed in charm-ceph-mon:
status:	New → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-04-02: Related fix proposed to charm-ceph-mon (stable/quincy.2)

Related fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914836

Revision history for this message

Nobuto Murata (nobuto) wrote on 2024-04-02:

https://review.opendev.org/c/openstack/charm-ceph-mon/+/914835

Changed in charm-ceph-mon:
status:	Fix Committed → Fix Released

Revision history for this message

Nobuto Murata (nobuto) wrote on 2024-04-02:

Fwiw, 17.2.7 is yet another big stable release to Quincy with 1,446 commits including many backports from Reef.
https://github.com/ceph/ceph/compare/v17.2.6...v17.2.7

And Ubuntu introduced it roughly 10 days ago.
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2043336

That explains why we needed to backport multiple charm patches from Reef to Quincy on some level although I cannot pinpoint the upstream commit out of 1,000+ easily.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-04-02: Related fix merged to charm-ceph-mon (stable/quincy.2)

#10

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914836
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/01bd228242a1fe437e9f0f544aa4b4e093c56e2d
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit 01bd228242a1fe437e9f0f544aa4b4e093c56e2d
Author: Luciano Lo Giudice <email address hidden>
Date: Wed Jan 3 18:10:30 2024 -0300

Retry setting rbd_stats_pools prometheus config

    Setting the 'mgr/prometheus/rbd_stats_pools' option can fail
    if we arrive too early, even if the cluster is bootstrapped. This is
    particularly seen in ceph-radosgw test runs. This patchset thus
    adds a retry decorator to work around this issue.

Related-Bug: #2042405
Related-Bug: #2058636

Change-Id: Id9b7b903e67154e7d2bb6fecbeef7fac126804a8
(cherry picked from commit d76939ef70bd5016a6e515558de1b9eabe9d0d55)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-04-02: Fix merged to charm-ceph-mon (stable/quincy.2)

#11

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914835
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/a65d9e22fb8e710d4d61a7b104ca3b1fa3072af7
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit a65d9e22fb8e710d4d61a7b104ca3b1fa3072af7
Author: Peter Sabaini <email address hidden>
Date: Tue Jan 16 11:21:07 2024 +0100

Don't error out on missing OSDs

    Ceph reef has a behaviour change where it doesn't always return
    version keys for all components. In
    I12a1bcd32be2ed8a8e5ee0e304f716f5a190bd57 an attempt was made to fix
    this by retrying, however this code path can also be hit when a
    component such as OSDs are absent. While a cluster without OSDs
    wouldn't be functional it still should not cause the charm to error.

As a fix, just make the OSD component optional when querying for a
version instead of retrying.

Closes-Bug: #2058636

Resolved Conflicts:
src/utils.py

Change-Id: I5524896c7ad944f6f22fb1498ab0069397b52418
(cherry picked from commit 1c9f3b210d8bf8904143647443133cf35f48d8b7)

Revision history for this message

Nobuto Murata (nobuto) wrote on 2024-04-03:

#12

Should be good by now.

https://launchpad.net/~openstack-charmers/charm-ceph-mon/+charm/charm-ceph-mon.stable-quincy.2.quincy/+build/22328

$ juju info ceph-mon | grep quincy/stable
quincy/stable: 204 2024-04-02 (204) 8MB amd64, arm64, ppc64el, s390x ubuntu@20.04, ubuntu@22.04, ubuntu@22.10, ubuntu@23.04

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.