in _all_ceph_versions_same: if len(versions_dict['osd']) < 1: KeyError: 'osd'
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph Monitor Charm |
Fix Released
|
Undecided
|
Unassigned | ||
Quincy.2 |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
ceph-mon quincy/stable 201
There seems to be a race condition in a deployment. I saw twice in a row today and I'm not sure if it's related to the SRU completed recently or not.
In any case, I believe I saw a similar (or the same?) failure before in CI: https:/
unit-ceph-mon-2: 09:58:15 INFO unit.ceph-
unit-ceph-mon-2: 09:58:15 ERROR unit.ceph-
Traceback (most recent call last):
File "/var/lib/
main(
File "/var/lib/
_emit_
File "/var/lib/
event_
File "/var/lib/
framework.
File "/var/lib/
self.
File "/var/lib/
custom_
File "/var/lib/
if hooks.mon_
File "/var/lib/
notify_
File "/var/lib/
notify_osds()
File "/var/lib/
osd_
File "/var/lib/
execute_
File "/var/lib/
if (_all_ceph_
File "/var/lib/
if len(versions_
KeyError: 'osd'
unit-ceph-mon-2: 09:58:16 ERROR juju.worker.
$ juju show-status-log ceph-mon/2
Time Type Status Message
21 Mar 2024 09:57:29Z juju-unit executing running mon-relation-joined hook for ceph-mon/1
21 Mar 2024 09:57:30Z workload blocked Unit not clustered (no quorum)
21 Mar 2024 09:57:31Z juju-unit executing running mon-relation-
21 Mar 2024 09:57:31Z workload maintenance Bootstrapping MON cluster
21 Mar 2024 09:57:54Z workload maintenance Bootstrapping Ceph MGR
21 Mar 2024 09:58:16Z juju-unit error hook failed: "mon-relation-
21 Mar 2024 09:58:21Z juju-unit executing running mon-relation-
21 Mar 2024 09:58:31Z juju-unit executing running osd-relation-joined hook for ceph-osd/1
21 Mar 2024 09:58:39Z juju-unit executing running osd-relation-
21 Mar 2024 09:58:46Z workload waiting Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
21 Mar 2024 09:58:47Z juju-unit executing running osd-relation-joined hook for ceph-osd/2
21 Mar 2024 10:01:08Z juju-unit executing running osd-relation-
21 Mar 2024 10:01:21Z juju-unit executing running osd-relation-
21 Mar 2024 10:01:32Z juju-unit idle
21 Mar 2024 10:01:59Z juju-unit executing running mds-relation-joined hook for ceph-fs/0
21 Mar 2024 10:02:02Z juju-unit executing running mds-relation-
21 Mar 2024 10:02:04Z juju-unit idle
21 Mar 2024 10:02:08Z juju-unit executing running mds-relation-
21 Mar 2024 10:02:22Z juju-unit idle
21 Mar 2024 10:53:51Z workload active Unit is ready and clustered
Changed in charm-ceph-mon: | |
status: | New → Fix Committed |
Nobuto, are you seeing this during deploy time or when a specific hook or event runs? We currently have some code in place that retries the part that failed because of race conditions, so it could be a matter of increasing the retries/backoff, or use a different strategy altogether.