Charm should have option to tune how long vaultlocker and ceph-volume wait for vault unseal

Bug #1897777 reported by Drew Freiberger
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Triaged
Medium
Unassigned

Bug Description

This issue is related to lp#1804261, https://bugs.launchpad.net/charm-ceph-osd/+bug/1804261

In a bionic-ussuri release, I find that the timeout for ceph-volume is set to 2 hours via the systemd service file, but the environment variables for retries and interval discussed in lp#1804261 are not set by either the packages or the charm.

I expect that ceph-osd charm, or the services it configures on the host, should be responsible for monitoring the vault's sealed status and allow for either a configurable, or infinite timeout for the vault to come online to decrypt and start ceph-volumes and ceph-osds, so that operators do not have to ssh to cloud nodes after unsealing the vault on full datacenter power failure scenarios.

In a test power-outage situation, it took over the 2 hour timeout to operationally stabilize the mysql cluster from full power-down before I could unseal the vault, and then ceph-osd was not able to start the OSD processes due to the vaultlocker-decrypt and ceph-volume services not sitting in an infinite (or long enough) retry state.

This timeout should either be charm configurable, or should be something that can be triggered via action to bring ceph-osds online after vault is unsealed after any given length of time that a host has been powered on.

My assumption in a charm deployed environment is that I should not have to login to ceph-osd to manually start services, no matter how long it takes me to recover the health of all of my nodes that run mysql and vault.

Unfortunately, power cycling the osd nodes to work around this isn't favorable in scenarios where ceph-osd is co-located on hosts that run either mysql or vault.

Revision history for this message
Drew Freiberger (afreiberger) wrote :
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I think CEPH_VOLUME_SYSTEMD_TRIES can be set in /etc/default/ceph and then systemd units can pick the setting up from there. Assuming nobody thinks tthe upstream or package default doesn't change (do the defaults make sense?) it does seem this should be configurable via charm config so that logging into a unit isn't required.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Triaging as medium for now as we have a work-around.

Changed in charm-ceph-osd:
status: New → Triaged
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.