Bug #1804261 “Ceph OSD units requires reboot if they boot before...” : Series ussuri : Bugs : Ubuntu Cloud Archive

Revision history for this message

James Page (james-page) wrote on 2018-11-20:

#1

That's definitely a bug - the systemd unit should spin until vault is unsealed, and then retrieve the keys and unlock the disks.

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2018-11-21:

#2

I suspect that we're running into systemd's backoff for a failed start

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2018-12-11:

#3

I have let a deploy sit after reaching this state (ceph-osd came up before vault was unsealed) for hours, then I unsealed, and waited for more than 12 hours before giving up - the OSDs seem to reach a final timeout at some point and give up for good

Revision history for this message

James Page (james-page) wrote on 2018-12-11:

#4

Do the block devices unlock? if they do it may be a timeout on the actual ceph-volume systemd units rather than in the vaultlocker-decrypt systemd unit with retrieves the key and chats with vault.

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2018-12-11:

#5

I'll do another run on it soon to confirm

Revision history for this message

James Page (james-page) wrote on 2018-12-11:

#6

OK so I reproduced - AFAICT the vaultlocker systemd unit does execute in the end, but the ceph-volume units trigger and then fail as the LVS's are not yet visible.

I thought they waited for some time but it would appear not.

Changed in charm-ceph-osd:
status:	New → Confirmed
importance:	Undecided → High
status:	Confirmed → Triaged
milestone:	none → 19.04

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-02-11:

#7

I believe that they retry some number of times, but if it's too long of a wait, maybe they give up?

James Page (james-page) on 2019-03-05

Changed in charm-ceph-osd:
assignee:	nobody → James Page (james-page)
status:	Triaged → In Progress

Peter Sabaini (peter-sabaini) on 2019-03-06

tags:

added: canonical-bootstack

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#8

rpviewer (1).png Edit (57.5 KiB, image/png)

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#9

On my test deployment a full cloud bounce resulted in the vaultlocker and ceph-volume tasks spinning blocking boot; the design was that they should not do this and that they should background spin rather than block.

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#10

The vaultlocker systemd units are of Type=oneshot which means that systemd won't consider the unit started until the subprocess completes.

Switching to Type=simple means they won't block (more fire and forget) but that will break any fstab styled systemd depends on type stuff.

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#11

Comments #8->#10 relate to the previous linked bug; I realised these are two different issues.

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#12

ceph-volume systemd tasks seem to fail quicker than the 10000 second timeout configured.

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#13

GROKing the code:

tries = os.environ.get('CEPH_VOLUME_SYSTEMD_TRIES', 30)
interval = os.environ.get('CEPH_VOLUME_SYSTEMD_INTERVAL', 5)

30 x 5 = 150 seconds until ceph volume gives up.

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#14

vs 10000 for vaultlocker.

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#15

Once vault has been unsealed, it should be possible todo:

juju run --application ceph-osd 'sudo systemctl restart ceph-volume@*'

This will re-trigger the initialisation of the OSD's

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#16

Scrub #15 - apparently wildcarding on that unit does not work.

Revision history for this message

James Page (james-page) wrote on 2019-03-07:

#17

I'd suggest we just increase the number of re-tries to allow operators more time to unseal vault in the event of a full site outage.

summary:

- Ceph OSD unit requires reboot if it boots before vault
+ Ceph OSD units requires reboot if they boot before vault (and if not
+ unsealed with 150s)

James Page (james-page) on 2019-03-12

Changed in charm-ceph-osd:
status:	In Progress → Triaged

David Ames (thedac) on 2019-04-17

Changed in charm-ceph-osd:
milestone:	19.04 → 19.07

David Ames (thedac) on 2019-08-12

Changed in charm-ceph-osd:
milestone:	19.07 → 19.10

Revision history for this message

Janghoon-Paul Sim (janghoon) wrote on 2019-08-14:

#18

Could you please confirm that the 19.10 release includes a fix for this bug?

Janghoon-Paul Sim (janghoon) on 2019-08-14

tags:

added: sts

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-08-14:

#19

@janghoon: the 19.10 release will come out in October. I believe that some work may have been done to help resolve this in vaultlocker already but I'll see if I can replicate again

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-08-14:

#20

Download full text (7.6 KiB)

To confirm that this is still an issue as of the 19.07 charm release, I followed the below series of steps, reproduced with their results:

$ juju run --all "sudo reboot"

# everything goes down:

$ juju status
Model Controller Cloud/Region Version SLA Timestamp
ceph-encryption-test icey-serverstack serverstack/serverstack 2.6.5 unsupported 14:32:04Z

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 12.2.12 active 0/3 ceph-mon jujucharms 42 ubuntu
ceph-osd 12.2.12 active 0/3 ceph-osd jujucharms 291 ubuntu
percona-cluster 5.7.20-29.24 active 0/1 percona-cluster jujucharms 279 ubuntu
vault 1.1.1 active 0/1 vault jujucharms 29 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0* unknown lost 3 10.5.0.15 agent lost, see 'juju show-status-log ceph-mon/0'
ceph-mon/1 unknown lost 4 10.5.0.6 agent lost, see 'juju show-status-log ceph-mon/1'
ceph-mon/2 unknown lost 5 10.5.0.11 agent lost, see 'juju show-status-log ceph-mon/2'
ceph-osd/0* unknown lost 0 10.5.0.30 agent lost, see 'juju show-status-log ceph-osd/0'
ceph-osd/1 unknown lost 1 10.5.0.26 agent lost, see 'juju show-status-log ceph-osd/1'
ceph-osd/2 unknown lost 2 10.5.0.16 agent lost, see 'juju show-status-log ceph-osd/2'
percona-cluster/0* unknown lost 7 10.5.0.5 3306/tcp agent lost, see 'juju show-status-log percona-cluster/0'
vault/0* unknown lost 6 10.5.0.25 8200/tcp agent lost, see 'juju show-status-log vault/0'

Machine State DNS Inst id Series AZ Message
0 down 10.5.0.30 cd988b6e-ca77-4e1c-8bc4-8816154a7775 bionic nova ACTIVE
1 down 10.5.0.26 4eb88c74-4c80-4715-b826-baf6e3cdf77c bionic nova ACTIVE
2 down 10.5.0.16 2eb713e9-8d2c-4fe9-bc85-70aceb20ca03 bionic nova ACTIVE
3 down 10.5.0.15 7f6521cf-f412-47d8-95d3-50edce13f21d bionic nova ACTIVE
4 down 10.5.0.6 0d55d065-a063-4124-8fa5-46ae48b65fb1 bionic nova ACTIVE
5 down 10.5.0.11 bfc25bb4-e208-49e0-87af-594100f2a2f0 bionic nova ACTIVE
6 down 10.5.0.25 f7c609a8-67b7-46dd-bf1a-fe921f23ea26 bionic nova ACTIVE
7 down 10.5.0.5 d1122afd-32b1-421b-b8e2-4147100ec217 bionic nova ACTIVE

# Wait a while, services evolve into:

Model Controller Cloud/Region Version SLA Timestamp
ceph-encryption-test icey-serverstack serverstack/serverstack 2.6.5 unsupported 14:48:00Z

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 12.2.12 active 3 ceph-mon jujucharms 42 ubuntu
ceph-osd 12.2.12 blocked 3 ceph-osd jujucharms 291 ubuntu...

To confirm that this is still an issue as of the 19.07 charm release, I followed the below series of steps, reproduced with their results:

$ juju run --all "sudo reboot"

# everything goes down:

$ juju status
Model                 Controller        Cloud/Region             Version  SLA          Timestamp
ceph-encryption-test  icey-serverstack  serverstack/serverstack  2.6.5    unsupported  14:32:04Z

App              Version       Status  Scale  Charm            Store       Rev  OS      Notes
ceph-mon         12.2.12       active    0/3  ceph-mon         jujucharms   42  ubuntu  
ceph-osd         12.2.12       active    0/3  ceph-osd         jujucharms  291  ubuntu  
percona-cluster  5.7.20-29.24  active    0/1  percona-cluster  jujucharms  279  ubuntu  
vault            1.1.1         active    0/1  vault            jujucharms   29  ubuntu

Unit                Workload  Agent  Machine  Public address  Ports     Message
ceph-mon/0*         unknown   lost   3        10.5.0.15                 agent lost, see 'juju show-status-log ceph-mon/0'
ceph-mon/1          unknown   lost   4        10.5.0.6                  agent lost, see 'juju show-status-log ceph-mon/1'
ceph-mon/2          unknown   lost   5        10.5.0.11                 agent lost, see 'juju show-status-log ceph-mon/2'
ceph-osd/0*         unknown   lost   0        10.5.0.30                 agent lost, see 'juju show-status-log ceph-osd/0'
ceph-osd/1          unknown   lost   1        10.5.0.26                 agent lost, see 'juju show-status-log ceph-osd/1'
ceph-osd/2          unknown   lost   2        10.5.0.16                 agent lost, see 'juju show-status-log ceph-osd/2'
percona-cluster/0*  unknown   lost   7        10.5.0.5        3306/tcp  agent lost, see 'juju show-status-log percona-cluster/0'
vault/0*            unknown   lost   6        10.5.0.25       8200/tcp  agent lost, see 'juju show-status-log vault/0'

Machine  State  DNS        Inst id                               Series  AZ    Message
0        down   10.5.0.30  cd988b6e-ca77-4e1c-8bc4-8816154a7775  bionic  nova  ACTIVE
1        down   10.5.0.26  4eb88c74-4c80-4715-b826-baf6e3cdf77c  bionic  nova  ACTIVE
2        down   10.5.0.16  2eb713e9-8d2c-4fe9-bc85-70aceb20ca03  bionic  nova  ACTIVE
3        down   10.5.0.15  7f6521cf-f412-47d8-95d3-50edce13f21d  bionic  nova  ACTIVE
4        down   10.5.0.6   0d55d065-a063-4124-8fa5-46ae48b65fb1  bionic  nova  ACTIVE
5        down   10.5.0.11  bfc25bb4-e208-49e0-87af-594100f2a2f0  bionic  nova  ACTIVE
6        down   10.5.0.25  f7c609a8-67b7-46dd-bf1a-fe921f23ea26  bionic  nova  ACTIVE
7        down   10.5.0.5   d1122afd-32b1-421b-b8e2-4147100ec217  bionic  nova  ACTIVE

# Wait a while, services evolve into:

Model                 Controller        Cloud/Region             Version  SLA          Timestamp
ceph-encryption-test  icey-serverstack  serverstack/serverstack  2.6.5    unsupported  14:48:00Z

App              Version       Status   Scale  Charm            Store       Rev  OS      Notes
ceph-mon         12.2.12       active       3  ceph-mon         jujucharms   42  ubuntu  
ceph-osd         12.2.12       blocked      3  ceph-osd         jujucharms  291  ubuntu  
percona-cluster  5.7.20-29.24  active       1  percona-cluster  jujucharms  279  ubuntu  
vault            1.1.1         blocked      1  vault            jujucharms   29  ubuntu

Unit                Workload  Agent  Machine  Public address  Ports     Message
ceph-mon/0*         active    idle   3        10.5.0.15                 Unit is ready and clustered
ceph-mon/1          active    idle   4        10.5.0.6                  Unit is ready and clustered
ceph-mon/2          active    idle   5        10.5.0.11                 Unit is ready and clustered
ceph-osd/0*         blocked   idle   0        10.5.0.30                 No block devices detected using current configuration
ceph-osd/1          blocked   idle   1        10.5.0.26                 No block devices detected using current configuration
ceph-osd/2          blocked   idle   2        10.5.0.16                 No block devices detected using current configuration
percona-cluster/0*  active    idle   7        10.5.0.5        3306/tcp  Unit is ready
vault/0*            blocked   idle   6        10.5.0.25       8200/tcp  Vault service not running

Machine  State    DNS        Inst id                               Series  AZ    Message
0        started  10.5.0.30  cd988b6e-ca77-4e1c-8bc4-8816154a7775  bionic  nova  ACTIVE
1        started  10.5.0.26  4eb88c74-4c80-4715-b826-baf6e3cdf77c  bionic  nova  ACTIVE
2        started  10.5.0.16  2eb713e9-8d2c-4fe9-bc85-70aceb20ca03  bionic  nova  ACTIVE
3        started  10.5.0.15  7f6521cf-f412-47d8-95d3-50edce13f21d  bionic  nova  ACTIVE
4        started  10.5.0.6   0d55d065-a063-4124-8fa5-46ae48b65fb1  bionic  nova  ACTIVE
5        started  10.5.0.11  bfc25bb4-e208-49e0-87af-594100f2a2f0  bionic  nova  ACTIVE
6        started  10.5.0.25  f7c609a8-67b7-46dd-bf1a-fe921f23ea26  bionic  nova  ACTIVE
7        started  10.5.0.5   d1122afd-32b1-421b-b8e2-4147100ec217  bionic  nova  ACTIVE

# ... wait a while ...

# Had to manually restart the vault service (`juju run --application=vault "sudo systemctl start vault"`) as it hadn't come up on it's own for some reason

# re-unseal Vault; then, over an hour later:

$ juju status
Model                 Controller        Cloud/Region             Version  SLA          Timestamp
ceph-encryption-test  icey-serverstack  serverstack/serverstack  2.6.5    unsupported  18:16:46Z

App              Version       Status   Scale  Charm            Store       Rev  OS      Notes
ceph-mon         12.2.12       active       3  ceph-mon         jujucharms   42  ubuntu
ceph-osd         12.2.12       blocked      3  ceph-osd         jujucharms  291  ubuntu
percona-cluster  5.7.20-29.24  active       1  percona-cluster  jujucharms  279  ubuntu
vault            1.1.1         active       1  vault            jujucharms   29  ubuntu

Unit                Workload  Agent  Machine  Public address  Ports     Message
ceph-mon/0*         active    idle   3        10.5.0.15                 Unit is ready and clustered
ceph-mon/1          active    idle   4        10.5.0.6                  Unit is ready and clustered
ceph-mon/2          active    idle   5        10.5.0.11                 Unit is ready and clustered
ceph-osd/0*         blocked   idle   0        10.5.0.30                 No block devices detected using current configuration
ceph-osd/1          blocked   idle   1        10.5.0.26                 No block devices detected using current configuration
ceph-osd/2          blocked   idle   2        10.5.0.16                 No block devices detected using current configuration
percona-cluster/0*  active    idle   7        10.5.0.5        3306/tcp  Unit is ready
vault/0*            active    idle   6        10.5.0.25       8200/tcp  Unit is ready (active: true, mlock: enabled)

Machine  State    DNS        Inst id                               Series  AZ    Message
0        started  10.5.0.30  cd988b6e-ca77-4e1c-8bc4-8816154a7775  bionic  nova  ACTIVE
1        started  10.5.0.26  4eb88c74-4c80-4715-b826-baf6e3cdf77c  bionic  nova  ACTIVE
2        started  10.5.0.16  2eb713e9-8d2c-4fe9-bc85-70aceb20ca03  bionic  nova  ACTIVE
3        started  10.5.0.15  7f6521cf-f412-47d8-95d3-50edce13f21d  bionic  nova  ACTIVE
4        started  10.5.0.6   0d55d065-a063-4124-8fa5-46ae48b65fb1  bionic  nova  ACTIVE
5        started  10.5.0.11  bfc25bb4-e208-49e0-87af-594100f2a2f0  bionic  nova  ACTIVE
6        started  10.5.0.25  f7c609a8-67b7-46dd-bf1a-fe921f23ea26  bionic  nova  ACTIVE
7        started  10.5.0.5   d1122afd-32b1-421b-b8e2-4147100ec217  bionic  nova  ACTIVE

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-08-14:

#21

Validation flow Edit (7.4 KiB, text/plain)

and, because of how badly launchpad formats the plain text, https://pastebin.ubuntu.com/p/rrwkWjGwdt/ is a pastebin link, and attached is the plaintext version

David Ames (thedac) on 2019-10-24

Changed in charm-ceph-osd:
milestone:	19.10 → 20.01

Alex Kavanagh (ajkavanagh) on 2019-11-08

tags:

added: cold-start

James Page (james-page) on 2019-12-04

Changed in charm-ceph-osd:
assignee:	James Page (james-page) → nobody

Revision history for this message

James Page (james-page) wrote on 2019-12-04:

#22

Keying into:

tries = os.environ.get('CEPH_VOLUME_SYSTEMD_TRIES', 30)
interval = os.environ.get('CEPH_VOLUME_SYSTEMD_INTERVAL', 5)

might work - however the return type of .get is a str and the subsequent code then assumes tries is an int which will just explode....

so some fixes are probably needed into ceph itself to increase this timeout effectively.

dongdong tao (taodd) on 2019-12-05

Changed in charm-ceph-osd:
assignee:	nobody → dongdong tao (taodd)

Revision history for this message

dongdong tao (taodd) wrote on 2019-12-06:

#23

I was able to reproduce this bug. And what I've found is exactly matched with Jame's finding.

So, when the vault node was restarted and vault will become unsealed by default, if we then reboot the osd node.
The vaultlocker's systemd unit won't able to decrypt the osd block device (because it can't get the key from an unsealed vault server), which makes the ceph-volume systemd unit failed to discover the expected osd logical volume, thus ceph-osd will not be started by ceph-volume.
From the log of vaultlocker and ceph-volume, I can see the vaultlocker will retry for 10000 seconds. but ceph-volume gave up too soon, it gave up in about 150 seconds. We need to give the operator a longer time to unseal the vault, so need to let the ceph-volume try a longer time.

To verify if this will work, I've done an experiment.
1. changed the default CEPH_VOLUME_SYSTEMD_TRIES to 2000 on one osd unit and let other 2 osds remain the same.

2. reboot all the vault and osd unit.

3. unseal the vault after 1 hour.

The result is that only this osd was able to come up after unseal the vault, and for the other osd nodes, I have to restart them all in order to bring them back to the cluster.

I think we should change the default value of CEPH_VOLUME_SYSTEMD_TRIES to 2000 to match the timeout value.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2019-12-10:

#24

Confirmed issue still exists with latest vaultlocker (1.0.3-0ubuntu1.18.10.1~ubuntu18.04.1) and 19.10 charms.

Revision history for this message

dongdong tao (taodd) wrote on 2019-12-10:

#25

I forgot to paste the comment here.
Two ceph tracker bug issue opened and corresponding PR sent.
https://tracker.ceph.com/issues/43187
https://tracker.ceph.com/issues/43186

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2019-12-11:

#26

Side note, I was initially unable to manually recover because I was restarting the wrong ceph-volume service:

root@cephtest:~# systemctl -a| grep ceph-volume
  <email address hidden>
          loaded activating start start Ceph Volume activation: bbfc0235-f8fd-458b-9c3d-21803b72f4bc
  <email address hidden>
          loaded inactive dead Ceph Volume activation: lvm-2-bbfc0235-f8fd-458b-9c3d-21803b72f4bc

i.e. there are two and it is the lvm* one that needs restarting (i tried to restart the other which didnt work).

Changed in charm-ceph-osd:
assignee:	dongdong tao (taodd) → nobody
status:	Triaged → Invalid
importance:	High → Undecided
Changed in ceph (Ubuntu):
importance:	Undecided → High
assignee:	nobody → dongdong tao (taodd)

James Page (james-page) on 2020-02-13

Changed in ceph (Ubuntu Focal):
status:	New → Fix Released
Changed in ceph (Ubuntu Disco):
status:	New → Won't Fix

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2020-03-09:

#27

What is the test case for verification of this bug as part of the SRU? Since this bug is called out as being fixed by the new upload, please include the basic SRU information such as Test Case and Regression Potential. Thank you!

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2020-03-17: Please test proposed package

#28

Hello Chris, or anyone else affected,

Accepted ceph into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:rocky-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-rocky-needed

James Page (james-page) on 2020-03-18

description:

updated

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2020-03-30:

#29

Is this bug also present in eoan? If yes, it would be good to have a fix scheduled there too.

Changed in ceph (Ubuntu Bionic):
status:	New → Fix Committed
tags:	added: verification-needed verification-needed-bionic

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2020-03-30:

#30

Hello Chris, or anyone else affected,

Accepted ceph into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/12.2.13-0ubuntu0.18.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message

James Page (james-page) wrote on 2020-04-01:

#31

Hello Chris, or anyone else affected,

Accepted ceph into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:stein-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-stein-needed

Revision history for this message

James Page (james-page) wrote on 2020-04-01:

#32

@sil2100 - yes this bug was present in the eoan version of ceph - the fix is included in the update to 14.2.8 covered under bug 1861789 so I elected not to cover it specifically under this bug.

Changed in ceph (Ubuntu Eoan):
status:	New → Fix Committed

Revision history for this message

James Page (james-page) wrote on 2020-04-01:

#33

Hello Chris, or anyone else affected,

Accepted ceph into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:queens-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-queens-needed

Revision history for this message

James Page (james-page) wrote on 2020-04-07:

#34

Hello Chris, or anyone else affected,

Accepted ceph into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:stein-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message

James Page (james-page) wrote on 2020-04-07:

#35

Hello Chris, or anyone else affected,

Accepted ceph into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:rocky-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2020-04-14: Update Released

#36

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2020-04-14:

#37

This bug was fixed in the package ceph - 13.2.8-0ubuntu0.18.10.1~cloud0
---------------

ceph (13.2.8-0ubuntu0.18.10.1~cloud0) bionic; urgency=medium
.
   * New upstream release for the Ubuntu Cloud Archive (LP: #1864514).
   * d/p/bug1804261.patch: Cherry pick fix to ensure that ceph-volume
     tries and interval environment variables are converted to int
     (LP: #1804261).

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2020-04-14:

#38

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2020-04-14:

#39

This bug was fixed in the package ceph - 13.2.8-0ubuntu0.19.04.1~cloud1
---------------

ceph (13.2.8-0ubuntu0.19.04.1~cloud1) bionic; urgency=medium
.
   * d/p/bug1804261.patch: Cherry pick fix to ensure that ceph-volume
     tries and interval environment variables are converted to int
     (LP: #1804261).
   * New upstream release (LP: #1864514).

James Page (james-page) on 2020-04-20

Changed in ceph (Ubuntu Eoan):
status:	Fix Committed → Fix Released

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2020-04-20:

#40

Bionic/Queens is currently blocked on a potential regression in bug 1871820

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2020-06-01: Please test proposed package

#41

Hello Chris, or anyone else affected,

Accepted ceph into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/12.2.13-0ubuntu0.18.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message

dongdong tao (taodd) wrote on 2020-06-05:

#42

Hi All,

I can confirm this release fixed the bug, I used below steps to test

1. Deployed a ceph cluster with vault
2. Upgrade all the ceph packages to 12.2.13 at bionic-proposed
3. Add "Environment=CEPH_VOLUME_SYSTEMD_TRIES=2000" at /lib/systemd/system/ceph-volume@.service for some osd node
4. Reboot vault node, then reboot osd node
5. Wait for half an hour
6. Unseal vault
7. All the osd node with the setting "Environment=CEPH_VOLUME_SYSTEMD_TRIES=2000" can come online automatically, while the other osds without this setting can not come online, which is expected.

-Cheers

Edward Hope-Morley (hopem) on 2020-06-08

tags:

added: verification-done-bionic verification-rocky-done verification-stein-done
removed: verification-needed-bionic verification-rocky-needed verification-stein-needed

Revision history for this message

James Page (james-page) wrote on 2020-06-10:

#43

This fix need re-verification in bionic-proposed as a further change was added so the binaries have been rebuilt

Revision history for this message

dongdong tao (taodd) wrote on 2020-06-15:

#44

I have verified the fix in bionic-proposed and confirm it can fix this issue.
The test steps I've performed:
1. deployed a ceph cluster with vault
2. upgrade some of the osds to 12.2.13
3. Add "Environment=CEPH_VOLUME_SYSTEMD_TRIES=2000" at /lib/systemd/system/ceph-volume@.service for all osds
4. First reboot vault, then reboot all osds
5. Wait for about 1.5 hour
6. All osds with version 12.2.13 can come up, while other osds with 12.2.12 remain blocked

Cheers!

tags:

added: verification-done verification-queens-done
removed: verification-needed verification-queens-needed

Revision history for this message

dongdong tao (taodd) wrote on 2020-06-15:

#45

Just to clarify a bit to avoid confusion. In above comment, at step 5, I meant Wait for about 1.5 hour and unseal the vault.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-06-15:

#46

This bug was fixed in the package ceph - 12.2.13-0ubuntu0.18.04.2

---------------
ceph (12.2.13-0ubuntu0.18.04.2) bionic; urgency=medium

  * d/p/bug1871820.patch: Revert change in default concurrency for
    rocksdb background compactions to avoid potential data loss
    (LP: #1871820).

ceph (12.2.13-0ubuntu0.18.04.1) bionic; urgency=medium

  * New upstream point release (LP: #1861793).
  * d/p/bug1847544.patch,ceph-volume-wait-for-lvs.patch,dont-validate-fs-
    caps-on-authorize.patch,issue37490.patch,issue38454.patch,rgw-gc-use-
    aio.patch: Drop, all included in upstream release.
  * d/p/*: Refresh as needed.
  * d/p/bug1804261.patch: Cherry pick fix to ensure that ceph-volume
    tries and interval environment variables are converted to int
    (LP: #1804261).

-- James Page <email address hidden> Tue, 19 May 2020 08:40:13 +0100

Changed in ceph (Ubuntu Bionic):
status:	Fix Committed → Fix Released

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-09-29:

#47

This is still an issue with bionic-ussuri ceph 15.2.3-0ubuntu0.20.04.2~cloud0

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-09-29:

#48

The bionic-ussuri package has the retries set for 10000. My start time to vault unseal time was about 18 hours. We should have this set to heal for up to 5 days after machine start.

I'm almost wondering if vaultlocker-decrypt also needs the retries increased as well.

Here's a workaround I've found for anyone experiencing this operationally:

After unsealing the vault, loop through ceph-osd units with the following two loops to decrypt and start the LVM volumes for ceph-osd services to startup:

for i in $(ls /etc/systemd/system/multi-user.target.wants/vaultlocker-decrypt@*|cut -d/ -f6); do sudo systemctl start $i; done
for i in $(ls /etc/systemd/system/multi-user.target.wants/ceph-volume@*|cut -d/ -f6); do sudo systemctl start $i; done

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2020-09-29:

#49

I think CEPH_VOLUME_SYSTEMD_TRIES can be set in /etc/default/ceph and then systemd units can pick the setting up from there. A charm change is being tracked at LP: #1897777. I think we can move the discussion to that bug unless there are thoughts that the upstream or package defaults should be different.

Ubuntu Cloud Archive

Ceph OSD units requires reboot if they boot before vault (and if not unsealed with 150s)

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Ceph OSD Charm	Invalid	Undecided	Unassigned	Ceph OSD Charm 20.01
Ubuntu Cloud Archive	Fix Released	Undecided	Unassigned
Queens	Fix Released	Undecided	Unassigned
Rocky	Fix Released	Undecided	Unassigned
Stein	Fix Released	Undecided	Unassigned
Train	Fix Released	Undecided	Unassigned
Ussuri	Fix Released	Undecided	Unassigned
ceph (Ubuntu)	Fix Released	High	dongdong tao
Bionic	Fix Released	Undecided	Unassigned
Disco	Won't Fix	Undecided	Unassigned
Eoan	Fix Released	Undecided	Unassigned
Focal	Fix Released	High	dongdong tao