Ceph OSD Charm

Charm should have option to tune how long vaultlocker and ceph-volume wait for vault unseal

Bug #1897777 reported by Drew Freiberger on 2020-09-29

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Ceph OSD Charm	Triaged	Medium	Unassigned

Bug Description

This issue is related to lp#1804261, https://bugs.launchpad.net/charm-ceph-osd/+bug/1804261

In a bionic-ussuri release, I find that the timeout for ceph-volume is set to 2 hours via the systemd service file, but the environment variables for retries and interval discussed in lp#1804261 are not set by either the packages or the charm.

I expect that ceph-osd charm, or the services it configures on the host, should be responsible for monitoring the vault's sealed status and allow for either a configurable, or infinite timeout for the vault to come online to decrypt and start ceph-volumes and ceph-osds, so that operators do not have to ssh to cloud nodes after unsealing the vault on full datacenter power failure scenarios.

In a test power-outage situation, it took over the 2 hour timeout to operationally stabilize the mysql cluster from full power-down before I could unseal the vault, and then ceph-osd was not able to start the OSD processes due to the vaultlocker-decrypt and ceph-volume services not sitting in an infinite (or long enough) retry state.

This timeout should either be charm configurable, or should be something that can be triggered via action to bring ceph-osds online after vault is unsealed after any given length of time that a host has been powered on.

My assumption in a charm deployed environment is that I should not have to login to ceph-osd to manually start services, no matter how long it takes me to recover the health of all of my nodes that run mysql and vault.

Unfortunately, power cycling the osd nodes to work around this isn't favorable in scenarios where ceph-osd is co-located on hosts that run either mysql or vault.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-09-29:

workaround is mentioned in https://bugs.launchpad.net/charm-ceph-osd/+bug/1804261/comments/48

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2020-09-29:

I think CEPH_VOLUME_SYSTEMD_TRIES can be set in /etc/default/ceph and then systemd units can pick the setting up from there. Assuming nobody thinks tthe upstream or package default doesn't change (do the defaults make sense?) it does seem this should be configurable via charm config so that logging into a unit isn't required.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2020-09-29:

Triaging as medium for now as we have a work-around.

Changed in charm-ceph-osd:
status:	New → Triaged
importance:	Undecided → Medium

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.