Ceph OSD units requires reboot if they boot before vault (and if not unsealed with 150s)

Bug #1804261 reported by Chris MacNaughton
44
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Invalid
Undecided
Unassigned
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Queens
Fix Released
Undecided
Unassigned
Rocky
Fix Released
Undecided
Unassigned
Stein
Fix Released
Undecided
Unassigned
Train
Fix Released
Undecided
Unassigned
Ussuri
Fix Released
Undecided
Unassigned
ceph (Ubuntu)
Fix Released
High
dongdong tao
Bionic
Fix Released
Undecided
Unassigned
Disco
Won't Fix
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned
Focal
Fix Released
High
dongdong tao

Bug Description

[Impact]
Various configuration option values that are read from environment variables are incorrectly parsed as strings rather than ints which means that for certain deployment use-cases, the timeouts for starting the ceph-osd volume units cannot be increased to accommodate dependencies starting first.

[Test Case]
Deploy ceph with vault for key management
set a systemd override for ceph-volume@
Environment=CEPH_VOLUME_SYSTEMD_TRIES=2000
Seal vault units (by restarting the vault service)
Reboot ceph-osd machines - Environment override is ignored as its not correctly parsed.

[Regression Potential]
Low - this fix has been accept upstream in later releases.

[Original Bug Report]
In a scenario where Ceph is encrypted and using Vault as the keymanager, in a scenario where vault and ceph are both stopped, any OSDs on the unit(s) affected will require a further reboot if they try to start before vault is unsealed.

Revision history for this message
James Page (james-page) wrote :

That's definitely a bug - the systemd unit should spin until vault is unsealed, and then retrieve the keys and unlock the disks.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I suspect that we're running into systemd's backoff for a failed start

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I have let a deploy sit after reaching this state (ceph-osd came up before vault was unsealed) for hours, then I unsealed, and waited for more than 12 hours before giving up - the OSDs seem to reach a final timeout at some point and give up for good

Revision history for this message
James Page (james-page) wrote :

Do the block devices unlock? if they do it may be a timeout on the actual ceph-volume systemd units rather than in the vaultlocker-decrypt systemd unit with retrieves the key and chats with vault.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I'll do another run on it soon to confirm

Revision history for this message
James Page (james-page) wrote :

OK so I reproduced - AFAICT the vaultlocker systemd unit does execute in the end, but the ceph-volume units trigger and then fail as the LVS's are not yet visible.

I thought they waited for some time but it would appear not.

Changed in charm-ceph-osd:
status: New → Confirmed
importance: Undecided → High
status: Confirmed → Triaged
milestone: none → 19.04
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I believe that they retry some number of times, but if it's too long of a wait, maybe they give up?

James Page (james-page)
Changed in charm-ceph-osd:
assignee: nobody → James Page (james-page)
status: Triaged → In Progress
tags: added: canonical-bootstack
Revision history for this message
James Page (james-page) wrote :
Revision history for this message
James Page (james-page) wrote :

On my test deployment a full cloud bounce resulted in the vaultlocker and ceph-volume tasks spinning blocking boot; the design was that they should not do this and that they should background spin rather than block.

Revision history for this message
James Page (james-page) wrote :

The vaultlocker systemd units are of Type=oneshot which means that systemd won't consider the unit started until the subprocess completes.

Switching to Type=simple means they won't block (more fire and forget) but that will break any fstab styled systemd depends on type stuff.

Revision history for this message
James Page (james-page) wrote :

Comments #8->#10 relate to the previous linked bug; I realised these are two different issues.

Revision history for this message
James Page (james-page) wrote :

ceph-volume systemd tasks seem to fail quicker than the 10000 second timeout configured.

Revision history for this message
James Page (james-page) wrote :

GROKing the code:

    tries = os.environ.get('CEPH_VOLUME_SYSTEMD_TRIES', 30)
    interval = os.environ.get('CEPH_VOLUME_SYSTEMD_INTERVAL', 5)

30 x 5 = 150 seconds until ceph volume gives up.

Revision history for this message
James Page (james-page) wrote :

vs 10000 for vaultlocker.

Revision history for this message
James Page (james-page) wrote :

Once vault has been unsealed, it should be possible todo:

   juju run --application ceph-osd 'sudo systemctl restart ceph-volume@*'

This will re-trigger the initialisation of the OSD's

Revision history for this message
James Page (james-page) wrote :

Scrub #15 - apparently wildcarding on that unit does not work.

Revision history for this message
James Page (james-page) wrote :

I'd suggest we just increase the number of re-tries to allow operators more time to unseal vault in the event of a full site outage.

summary: - Ceph OSD unit requires reboot if it boots before vault
+ Ceph OSD units requires reboot if they boot before vault (and if not
+ unsealed with 150s)
James Page (james-page)
Changed in charm-ceph-osd:
status: In Progress → Triaged
David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 19.04 → 19.07
David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 19.07 → 19.10
Revision history for this message
Janghoon-Paul Sim (janghoon) wrote :

Could you please confirm that the 19.10 release includes a fix for this bug?

tags: added: sts
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

@janghoon: the 19.10 release will come out in October. I believe that some work may have been done to help resolve this in vaultlocker already but I'll see if I can replicate again

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :
Download full text (7.6 KiB)

To confirm that this is still an issue as of the 19.07 charm release, I followed the below series of steps, reproduced with their results:

$ juju run --all "sudo reboot"

# everything goes down:

$ juju status
Model Controller Cloud/Region Version SLA Timestamp
ceph-encryption-test icey-serverstack serverstack/serverstack 2.6.5 unsupported 14:32:04Z

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 12.2.12 active 0/3 ceph-mon jujucharms 42 ubuntu
ceph-osd 12.2.12 active 0/3 ceph-osd jujucharms 291 ubuntu
percona-cluster 5.7.20-29.24 active 0/1 percona-cluster jujucharms 279 ubuntu
vault 1.1.1 active 0/1 vault jujucharms 29 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0* unknown lost 3 10.5.0.15 agent lost, see 'juju show-status-log ceph-mon/0'
ceph-mon/1 unknown lost 4 10.5.0.6 agent lost, see 'juju show-status-log ceph-mon/1'
ceph-mon/2 unknown lost 5 10.5.0.11 agent lost, see 'juju show-status-log ceph-mon/2'
ceph-osd/0* unknown lost 0 10.5.0.30 agent lost, see 'juju show-status-log ceph-osd/0'
ceph-osd/1 unknown lost 1 10.5.0.26 agent lost, see 'juju show-status-log ceph-osd/1'
ceph-osd/2 unknown lost 2 10.5.0.16 agent lost, see 'juju show-status-log ceph-osd/2'
percona-cluster/0* unknown lost 7 10.5.0.5 3306/tcp agent lost, see 'juju show-status-log percona-cluster/0'
vault/0* unknown lost 6 10.5.0.25 8200/tcp agent lost, see 'juju show-status-log vault/0'

Machine State DNS Inst id Series AZ Message
0 down 10.5.0.30 cd988b6e-ca77-4e1c-8bc4-8816154a7775 bionic nova ACTIVE
1 down 10.5.0.26 4eb88c74-4c80-4715-b826-baf6e3cdf77c bionic nova ACTIVE
2 down 10.5.0.16 2eb713e9-8d2c-4fe9-bc85-70aceb20ca03 bionic nova ACTIVE
3 down 10.5.0.15 7f6521cf-f412-47d8-95d3-50edce13f21d bionic nova ACTIVE
4 down 10.5.0.6 0d55d065-a063-4124-8fa5-46ae48b65fb1 bionic nova ACTIVE
5 down 10.5.0.11 bfc25bb4-e208-49e0-87af-594100f2a2f0 bionic nova ACTIVE
6 down 10.5.0.25 f7c609a8-67b7-46dd-bf1a-fe921f23ea26 bionic nova ACTIVE
7 down 10.5.0.5 d1122afd-32b1-421b-b8e2-4147100ec217 bionic nova ACTIVE

# Wait a while, services evolve into:

Model Controller Cloud/Region Version SLA Timestamp
ceph-encryption-test icey-serverstack serverstack/serverstack 2.6.5 unsupported 14:48:00Z

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 12.2.12 active 3 ceph-mon jujucharms 42 ubuntu
ceph-osd 12.2.12 blocked 3 ceph-osd jujucharms 291 ubuntu...

Read more...

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

and, because of how badly launchpad formats the plain text, https://pastebin.ubuntu.com/p/rrwkWjGwdt/ is a pastebin link, and attached is the plaintext version

David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 19.10 → 20.01
tags: added: cold-start
James Page (james-page)
Changed in charm-ceph-osd:
assignee: James Page (james-page) → nobody
Revision history for this message
James Page (james-page) wrote :

Keying into:

    tries = os.environ.get('CEPH_VOLUME_SYSTEMD_TRIES', 30)
    interval = os.environ.get('CEPH_VOLUME_SYSTEMD_INTERVAL', 5)

might work - however the return type of .get is a str and the subsequent code then assumes tries is an int which will just explode....

so some fixes are probably needed into ceph itself to increase this timeout effectively.

dongdong tao (taodd)
Changed in charm-ceph-osd:
assignee: nobody → dongdong tao (taodd)
Revision history for this message
dongdong tao (taodd) wrote :

I was able to reproduce this bug. And what I've found is exactly matched with Jame's finding.

So, when the vault node was restarted and vault will become unsealed by default, if we then reboot the osd node.
The vaultlocker's systemd unit won't able to decrypt the osd block device (because it can't get the key from an unsealed vault server), which makes the ceph-volume systemd unit failed to discover the expected osd logical volume, thus ceph-osd will not be started by ceph-volume.
From the log of vaultlocker and ceph-volume, I can see the vaultlocker will retry for 10000 seconds. but ceph-volume gave up too soon, it gave up in about 150 seconds. We need to give the operator a longer time to unseal the vault, so need to let the ceph-volume try a longer time.

To verify if this will work, I've done an experiment.
1. changed the default CEPH_VOLUME_SYSTEMD_TRIES to 2000 on one osd unit and let other 2 osds remain the same.

2. reboot all the vault and osd unit.

3. unseal the vault after 1 hour.

The result is that only this osd was able to come up after unseal the vault, and for the other osd nodes, I have to restart them all in order to bring them back to the cluster.

I think we should change the default value of CEPH_VOLUME_SYSTEMD_TRIES to 2000 to match the timeout value.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Confirmed issue still exists with latest vaultlocker (1.0.3-0ubuntu1.18.10.1~ubuntu18.04.1) and 19.10 charms.

Revision history for this message
dongdong tao (taodd) wrote :

I forgot to paste the comment here.
Two ceph tracker bug issue opened and corresponding PR sent.
https://tracker.ceph.com/issues/43187
https://tracker.ceph.com/issues/43186

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Side note, I was initially unable to manually recover because I was restarting the wrong ceph-volume service:

root@cephtest:~# systemctl -a| grep ceph-volume
  <email address hidden>
          loaded activating start start Ceph Volume activation: bbfc0235-f8fd-458b-9c3d-21803b72f4bc
  <email address hidden>
          loaded inactive dead Ceph Volume activation: lvm-2-bbfc0235-f8fd-458b-9c3d-21803b72f4bc

i.e. there are two and it is the lvm* one that needs restarting (i tried to restart the other which didnt work).

Changed in charm-ceph-osd:
assignee: dongdong tao (taodd) → nobody
status: Triaged → Invalid
importance: High → Undecided
Changed in ceph (Ubuntu):
importance: Undecided → High
assignee: nobody → dongdong tao (taodd)
James Page (james-page)
Changed in ceph (Ubuntu Focal):
status: New → Fix Released
Changed in ceph (Ubuntu Disco):
status: New → Won't Fix
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

What is the test case for verification of this bug as part of the SRU? Since this bug is called out as being fixed by the new upload, please include the basic SRU information such as Test Case and Regression Potential. Thank you!

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Chris, or anyone else affected,

Accepted ceph into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-rocky-needed
James Page (james-page)
description: updated
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Is this bug also present in eoan? If yes, it would be good to have a fix scheduled there too.

Changed in ceph (Ubuntu Bionic):
status: New → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Chris, or anyone else affected,

Accepted ceph into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/12.2.13-0ubuntu0.18.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
James Page (james-page) wrote :

Hello Chris, or anyone else affected,

Accepted ceph into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:stein-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-stein-needed
Revision history for this message
James Page (james-page) wrote :

@sil2100 - yes this bug was present in the eoan version of ceph - the fix is included in the update to 14.2.8 covered under bug 1861789 so I elected not to cover it specifically under this bug.

Changed in ceph (Ubuntu Eoan):
status: New → Fix Committed
Revision history for this message
James Page (james-page) wrote :

Hello Chris, or anyone else affected,

Accepted ceph into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Revision history for this message
James Page (james-page) wrote :

Hello Chris, or anyone else affected,

Accepted ceph into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:stein-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
James Page (james-page) wrote :

Hello Chris, or anyone else affected,

Accepted ceph into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Corey Bryant (corey.bryant) wrote : Update Released

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package ceph - 13.2.8-0ubuntu0.18.10.1~cloud0
---------------

 ceph (13.2.8-0ubuntu0.18.10.1~cloud0) bionic; urgency=medium
 .
   * New upstream release for the Ubuntu Cloud Archive (LP: #1864514).
   * d/p/bug1804261.patch: Cherry pick fix to ensure that ceph-volume
     tries and interval environment variables are converted to int
     (LP: #1804261).

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package ceph - 13.2.8-0ubuntu0.19.04.1~cloud1
---------------

 ceph (13.2.8-0ubuntu0.19.04.1~cloud1) bionic; urgency=medium
 .
   * d/p/bug1804261.patch: Cherry pick fix to ensure that ceph-volume
     tries and interval environment variables are converted to int
     (LP: #1804261).
   * New upstream release (LP: #1864514).

James Page (james-page)
Changed in ceph (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Bionic/Queens is currently blocked on a potential regression in bug 1871820

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Chris, or anyone else affected,

Accepted ceph into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/12.2.13-0ubuntu0.18.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
dongdong tao (taodd) wrote :

Hi All,

I can confirm this release fixed the bug, I used below steps to test

1. Deployed a ceph cluster with vault
2. Upgrade all the ceph packages to 12.2.13 at bionic-proposed
3. Add "Environment=CEPH_VOLUME_SYSTEMD_TRIES=2000" at /lib/systemd/system/ceph-volume@.service for some osd node
4. Reboot vault node, then reboot osd node
5. Wait for half an hour
6. Unseal vault
7. All the osd node with the setting "Environment=CEPH_VOLUME_SYSTEMD_TRIES=2000" can come online automatically, while the other osds without this setting can not come online, which is expected.

-Cheers

tags: added: verification-done-bionic verification-rocky-done verification-stein-done
removed: verification-needed-bionic verification-rocky-needed verification-stein-needed
Revision history for this message
James Page (james-page) wrote :

This fix need re-verification in bionic-proposed as a further change was added so the binaries have been rebuilt

Revision history for this message
dongdong tao (taodd) wrote :

I have verified the fix in bionic-proposed and confirm it can fix this issue.
The test steps I've performed:
1. deployed a ceph cluster with vault
2. upgrade some of the osds to 12.2.13
3. Add "Environment=CEPH_VOLUME_SYSTEMD_TRIES=2000" at /lib/systemd/system/ceph-volume@.service for all osds
4. First reboot vault, then reboot all osds
5. Wait for about 1.5 hour
6. All osds with version 12.2.13 can come up, while other osds with 12.2.12 remain blocked

Cheers!

tags: added: verification-done verification-queens-done
removed: verification-needed verification-queens-needed
Revision history for this message
dongdong tao (taodd) wrote :

Just to clarify a bit to avoid confusion. In above comment, at step 5, I meant Wait for about 1.5 hour and unseal the vault.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 12.2.13-0ubuntu0.18.04.2

---------------
ceph (12.2.13-0ubuntu0.18.04.2) bionic; urgency=medium

  * d/p/bug1871820.patch: Revert change in default concurrency for
    rocksdb background compactions to avoid potential data loss
    (LP: #1871820).

ceph (12.2.13-0ubuntu0.18.04.1) bionic; urgency=medium

  * New upstream point release (LP: #1861793).
  * d/p/bug1847544.patch,ceph-volume-wait-for-lvs.patch,dont-validate-fs-
    caps-on-authorize.patch,issue37490.patch,issue38454.patch,rgw-gc-use-
    aio.patch: Drop, all included in upstream release.
  * d/p/*: Refresh as needed.
  * d/p/bug1804261.patch: Cherry pick fix to ensure that ceph-volume
    tries and interval environment variables are converted to int
    (LP: #1804261).

 -- James Page <email address hidden> Tue, 19 May 2020 08:40:13 +0100

Changed in ceph (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Drew Freiberger (afreiberger) wrote :

This is still an issue with bionic-ussuri ceph 15.2.3-0ubuntu0.20.04.2~cloud0

Revision history for this message
Drew Freiberger (afreiberger) wrote :

The bionic-ussuri package has the retries set for 10000. My start time to vault unseal time was about 18 hours. We should have this set to heal for up to 5 days after machine start.

I'm almost wondering if vaultlocker-decrypt also needs the retries increased as well.

Here's a workaround I've found for anyone experiencing this operationally:

After unsealing the vault, loop through ceph-osd units with the following two loops to decrypt and start the LVM volumes for ceph-osd services to startup:

for i in $(ls /etc/systemd/system/multi-user.target.wants/vaultlocker-decrypt@*|cut -d/ -f6); do sudo systemctl start $i; done
for i in $(ls /etc/systemd/system/multi-user.target.wants/ceph-volume@*|cut -d/ -f6); do sudo systemctl start $i; done

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I think CEPH_VOLUME_SYSTEMD_TRIES can be set in /etc/default/ceph and then systemd units can pick the setting up from there. A charm change is being tracked at LP: #1897777. I think we can move the discussion to that bug unless there are thoughts that the upstream or package defaults should be different.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.