Charm should have option to tune how long vaultlocker and ceph-volume wait for vault unseal
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
Triaged
|
Medium
|
Unassigned |
Bug Description
This issue is related to lp#1804261, https:/
In a bionic-ussuri release, I find that the timeout for ceph-volume is set to 2 hours via the systemd service file, but the environment variables for retries and interval discussed in lp#1804261 are not set by either the packages or the charm.
I expect that ceph-osd charm, or the services it configures on the host, should be responsible for monitoring the vault's sealed status and allow for either a configurable, or infinite timeout for the vault to come online to decrypt and start ceph-volumes and ceph-osds, so that operators do not have to ssh to cloud nodes after unsealing the vault on full datacenter power failure scenarios.
In a test power-outage situation, it took over the 2 hour timeout to operationally stabilize the mysql cluster from full power-down before I could unseal the vault, and then ceph-osd was not able to start the OSD processes due to the vaultlocker-decrypt and ceph-volume services not sitting in an infinite (or long enough) retry state.
This timeout should either be charm configurable, or should be something that can be triggered via action to bring ceph-osds online after vault is unsealed after any given length of time that a host has been powered on.
My assumption in a charm deployed environment is that I should not have to login to ceph-osd to manually start services, no matter how long it takes me to recover the health of all of my nodes that run mysql and vault.
Unfortunately, power cycling the osd nodes to work around this isn't favorable in scenarios where ceph-osd is co-located on hosts that run either mysql or vault.
workaround is mentioned in https:/ /bugs.launchpad .net/charm- ceph-osd/ +bug/1804261/ comments/ 48