Race in OSD setup

Bug #2037072 reported by Peter Sabaini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
New
Undecided
Unassigned

Bug Description

We sometimes get these tracebacks in CI (edited for brevity):

mon-relation-changed logger.go:60 Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9
mon-relation-changed logger.go:60 Running command: /usr/bin/systemctl enable ceph-volume@lvm-9-a48e34ad-193e-40dd-ac22-17b2a0920877
mon-relation-changed logger.go:60 stderr: Created symlink /<email address hidden> → /lib/systemd/system/ceph-volume@.service.
mon-relation-changed logger.go:60 Running command: /usr/bin/systemctl enable --runtime ceph-osd@9
mon-relation-changed logger.go:60 stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@9.service → /lib/systemd/system/ceph-osd@.service.
mon-relation-changed logger.go:60 Running command: /usr/bin/systemctl start ceph-osd@9
mon-relation-changed logger.go:60 --> ceph-volume lvm activate successful for osd ID: 9
mon-relation-changed logger.go:60 --> ceph-volume lvm create successful for: ceph-a48e34ad-193e-40dd-ac22-17b2a0920877/osd-block-a48e34ad-193e-40dd-ac22-17b2a0920877
mon-relation-changed logger.go:60 Can't get admin socket path: unable to get conf option admin_socket for osd: b"error parsing 'osd': expected string of the form TYPE.ID, valid types are: auth, mon, osd, mds, mgr, client\n"
mon-relation-changed logger.go:60 Traceback (most recent call last):
mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-osd-0/charm/hooks/mon-relation-changed", line 908, in <module>
mon-relation-changed logger.go:60 hooks.execute(sys.argv)
mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-osd-0/charm/hooks/charmhelpers/core/hookenv.py", line 963, in execute
mon-relation-changed logger.go:60 self._hooks[hook_name]()
mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-osd-0/charm/hooks/mon-relation-changed", line 668, in mon_relation
mon-relation-changed logger.go:60 ceph.apply_osd_settings(settings)
mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 3425, in apply_osd_settings
mon-relation-changed logger.go:60 subprocess.check_output(cmd.split()).decode('UTF-8'))
mon-relation-changed logger.go:60 File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
mon-relation-changed logger.go:60 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
mon-relation-changed logger.go:60 File "/usr/lib/python3.10/subprocess.py", line 526, in run
mon-relation-changed logger.go:60 raise CalledProcessError(retcode, process.args,
mon-relation-changed logger.go:60 subprocess.CalledProcessError: Command '['ceph', 'daemon', 'osd.9', 'config', '--format=json', 'get', 'osd_heartbeat_grace']' returned non-zero exit status 22.

I believe this is due to a race in the mon-relation-changed hook, where after `prepare_disks_and_activate()` we shortly call `ceph.apply_osd_settings(settings)`.

The prepare call also starts the OSDs but that is async -- it'll return before the OSD service is fully up. To apply OSD settings otoh we need the service to listen on the admin socket, so this can race which I believe would result in the above error.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.