in _all_ceph_versions_same: if len(versions_dict['osd']) < 1: KeyError: 'osd'

Bug #2058636 reported by Nobuto Murata
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Released
Undecided
Unassigned
Quincy.2
Fix Released
Undecided
Unassigned

Bug Description

ceph-mon quincy/stable 201

There seems to be a race condition in a deployment. I saw twice in a row today and I'm not sure if it's related to the SRU completed recently or not.

In any case, I believe I saw a similar (or the same?) failure before in CI: https://review.opendev.org/c/openstack/charm-ceph-mon/+/896951

unit-ceph-mon-2: 09:58:15 INFO unit.ceph-mon/2.juju-log mon:1: Executing post-ceph-osd upgrade commands.
unit-ceph-mon-2: 09:58:15 ERROR unit.ceph-mon/2.juju-log mon:1: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/./src/charm.py", line 317, in <module>
    main(CephMonCharm)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/./src/charm.py", line 132, in on_mon_relation
    if hooks.mon_relation():
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/ceph_hooks.py", line 558, in mon_relation
    notify_relations()
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/ceph_hooks.py", line 631, in notify_relations
    notify_osds()
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/ceph_hooks.py", line 652, in notify_osds
    osd_relation(relid=relid, unit=unit)
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/ceph_hooks.py", line 897, in osd_relation
    execute_post_osd_upgrade_steps(ceph_osd_releases[0])
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/utils.py", line 321, in execute_post_osd_upgrade_steps
    if (_all_ceph_versions_same() and
  File "/var/lib/juju/agents/unit-ceph-mon-2/charm/src/utils.py", line 352, in _all_ceph_versions_same
    if len(versions_dict['osd']) < 1:
KeyError: 'osd'
unit-ceph-mon-2: 09:58:16 ERROR juju.worker.uniter.operation hook "mon-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

$ juju show-status-log ceph-mon/2
Time Type Status Message
21 Mar 2024 09:57:29Z juju-unit executing running mon-relation-joined hook for ceph-mon/1
21 Mar 2024 09:57:30Z workload blocked Unit not clustered (no quorum)
21 Mar 2024 09:57:31Z juju-unit executing running mon-relation-changed hook for ceph-mon/1
21 Mar 2024 09:57:31Z workload maintenance Bootstrapping MON cluster
21 Mar 2024 09:57:54Z workload maintenance Bootstrapping Ceph MGR
21 Mar 2024 09:58:16Z juju-unit error hook failed: "mon-relation-changed"
21 Mar 2024 09:58:21Z juju-unit executing running mon-relation-changed hook for ceph-mon/1
21 Mar 2024 09:58:31Z juju-unit executing running osd-relation-joined hook for ceph-osd/1
21 Mar 2024 09:58:39Z juju-unit executing running osd-relation-changed hook for ceph-osd/1
21 Mar 2024 09:58:46Z workload waiting Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
21 Mar 2024 09:58:47Z juju-unit executing running osd-relation-joined hook for ceph-osd/2
21 Mar 2024 10:01:08Z juju-unit executing running osd-relation-changed hook for ceph-osd/2
21 Mar 2024 10:01:21Z juju-unit executing running osd-relation-changed hook for ceph-osd/0
21 Mar 2024 10:01:32Z juju-unit idle
21 Mar 2024 10:01:59Z juju-unit executing running mds-relation-joined hook for ceph-fs/0
21 Mar 2024 10:02:02Z juju-unit executing running mds-relation-changed hook for ceph-fs/0
21 Mar 2024 10:02:04Z juju-unit idle
21 Mar 2024 10:02:08Z juju-unit executing running mds-relation-changed hook for ceph-fs/0
21 Mar 2024 10:02:22Z juju-unit idle
21 Mar 2024 10:53:51Z workload active Unit is ready and clustered

Revision history for this message
Luciano Lo Giudice (lmlogiudice) wrote :

Nobuto, are you seeing this during deploy time or when a specific hook or event runs? We currently have some code in place that retries the part that failed because of race conditions, so it could be a matter of increasing the retries/backoff, or use a different strategy altogether.

Revision history for this message
Nobuto Murata (nobuto) wrote :

The error is in the deployment time especially when bootstrapping the MON cluster.

21 Mar 2024 09:57:31Z workload maintenance Bootstrapping MON cluster
21 Mar 2024 09:57:54Z workload maintenance Bootstrapping Ceph MGR
21 Mar 2024 09:58:16Z juju-unit error hook failed: "mon-relation-changed"

I don't think https://review.opendev.org/c/openstack/charm-ceph-mon/+/896951 was backported to the stable branches. Doing so would mitigate the issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/913941

Revision history for this message
Nobuto Murata (nobuto) wrote :
Download full text (4.8 KiB)

It's observed in the CI job too:

https://review.opendev.org/c/openstack/charm-ceph-mon/+/913814/1#message-0ca38a491712c9c6811160afa15c12337aca3ece

2024-03-22 00:14:23.622503 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: 2024-03-22 00:14:20 ERROR unit.ceph-mon/1.juju-log server.go:316 mon:0: Uncaught exception while in charm code:
2024-03-22 00:14:23.622591 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: Traceback (most recent call last):
2024-03-22 00:14:23.622706 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: File "/var/lib/juju/agents/unit-ceph-mon-1/charm/./src/charm.py", line 317, in <module>
2024-03-22 00:14:23.622795 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: main(CephMonCharm)
2024-03-22 00:14:23.622909 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/main.py", line 436, in main
2024-03-22 00:14:23.623006 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: _emit_charm_event(charm, dispatcher.event_name)
2024-03-22 00:14:23.623125 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/main.py", line 144, in _emit_charm_event
2024-03-22 00:14:23.623214 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: event_to_emit.emit(*args, **kwargs)
2024-03-22 00:14:23.623309 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/framework.py", line 351, in emit
2024-03-22 00:14:23.623396 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: framework._emit(event)
2024-03-22 00:14:23.623483 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/framework.py", line 853, in _emit
2024-03-22 00:14:23.623570 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: self._reemit(event_path)
2024-03-22 00:14:23.623658 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: File "/var/lib/juju/agents/unit-ceph-mon-1/charm/venv/ops/framework.py", line 942, in _reemit
2024-03-22 00:14:23.623778 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: custom_handler(event)
2024-03-22 00:14:23.623879 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: File "/var/lib/juju/agents/unit-ceph-mon-1/charm/./src/charm.py", line 132, in on_mon_relation
2024-03-22 00:14:23.623993 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: if hooks.mon_relation():
2024-03-22 00:14:23.624085 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: File "/var/lib/juju/agents/unit-ceph-mon-1/charm/src/ceph_hooks.py", line 558, in mon_relation
2024-03-22 00:14:23.624172 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: notify_relations()
2024-03-22 00:14:23.624260 | focal-medium | 2024-03-22 00:14:23 [ERROR] unit-ceph-mon-1.log: File "/var/lib/juju/agents/unit-ceph-mon-1/charm/src/ceph_hooks.py", line 631, in notify_relations
2024-03-22 00:14:23.624361 | fo...

Read more...

Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high. Since it's blocking landing other fixes to the stable branch for more than 7 days now.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/913941
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/f3d290b55dce0a44b638ed600c8192472874f65f
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit f3d290b55dce0a44b638ed600c8192472874f65f
Author: Peter Sabaini <email address hidden>
Date: Fri Sep 29 15:30:00 2023 +0200

    Fix version retrieval

    During cluster deployment a situation can arise where there are
    already osd relations but osds are not yet fully added to the cluster.
    This can make version retrieval fail for osds. Retry version retrieval
    to give the cluster a chance to settle.

    Conflicts:
            src/utils.py
            tests/bundles/jammy-zed.yaml

    Closes-Bug: #2058636

    Change-Id: I12a1bcd32be2ed8a8e5ee0e304f716f5a190bd57
    (cherry picked from commit 55beb2504d3ea6d7f522d8d9a46bef7d741f1edc)

Nobuto Murata (nobuto)
Changed in charm-ceph-mon:
status: New → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-mon (stable/quincy.2)

Related fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914836

Revision history for this message
Nobuto Murata (nobuto) wrote :
Changed in charm-ceph-mon:
status: Fix Committed → Fix Released
Revision history for this message
Nobuto Murata (nobuto) wrote :

Fwiw, 17.2.7 is yet another big stable release to Quincy with 1,446 commits including many backports from Reef.
https://github.com/ceph/ceph/compare/v17.2.6...v17.2.7

And Ubuntu introduced it roughly 10 days ago.
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2043336

That explains why we needed to backport multiple charm patches from Reef to Quincy on some level although I cannot pinpoint the upstream commit out of 1,000+ easily.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914836
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/01bd228242a1fe437e9f0f544aa4b4e093c56e2d
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit 01bd228242a1fe437e9f0f544aa4b4e093c56e2d
Author: Luciano Lo Giudice <email address hidden>
Date: Wed Jan 3 18:10:30 2024 -0300

    Retry setting rbd_stats_pools prometheus config

    Setting the 'mgr/prometheus/rbd_stats_pools' option can fail
    if we arrive too early, even if the cluster is bootstrapped. This is
    particularly seen in ceph-radosgw test runs. This patchset thus
    adds a retry decorator to work around this issue.

    Related-Bug: #2042405
    Related-Bug: #2058636

    Change-Id: Id9b7b903e67154e7d2bb6fecbeef7fac126804a8
    (cherry picked from commit d76939ef70bd5016a6e515558de1b9eabe9d0d55)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914835
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/a65d9e22fb8e710d4d61a7b104ca3b1fa3072af7
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit a65d9e22fb8e710d4d61a7b104ca3b1fa3072af7
Author: Peter Sabaini <email address hidden>
Date: Tue Jan 16 11:21:07 2024 +0100

    Don't error out on missing OSDs

    Ceph reef has a behaviour change where it doesn't always return
    version keys for all components. In
    I12a1bcd32be2ed8a8e5ee0e304f716f5a190bd57 an attempt was made to fix
    this by retrying, however this code path can also be hit when a
    component such as OSDs are absent. While a cluster without OSDs
    wouldn't be functional it still should not cause the charm to error.

    As a fix, just make the OSD component optional when querying for a
    version instead of retrying.

    Closes-Bug: #2058636

    Resolved Conflicts:
            src/utils.py

    Change-Id: I5524896c7ad944f6f22fb1498ab0069397b52418
    (cherry picked from commit 1c9f3b210d8bf8904143647443133cf35f48d8b7)

Revision history for this message
Nobuto Murata (nobuto) wrote :

Should be good by now.

https://launchpad.net/~openstack-charmers/charm-ceph-mon/+charm/charm-ceph-mon.stable-quincy.2.quincy/+build/22328

$ juju info ceph-mon | grep quincy/stable
  quincy/stable: 204 2024-04-02 (204) 8MB amd64, arm64, ppc64el, s390x ubuntu@20.04, ubuntu@22.04, ubuntu@22.10, ubuntu@23.04

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.