Ceph OSD Charm

config_changed->check_for_upgrade incorrectly assumes upgrade in progress if OSD not yet mounted / ceph-disk@ has not started

Bug #1779828 reported by Trent Lloyd on 2018-07-03

This bug affects 7 people

Affects		Status	Importance	Assigned to	Milestone
	Ceph OSD Charm	Fix Released	High	Trent Lloyd	Ceph OSD Charm 19.10

Bug Description

The charm sometimes gets stuck waiting for another host to upgrade, even when an upgrade is not really in progress. You will observe "Waiting on <other_host> to finish upgrading" in the juju unit status, and the config-changed hooks stays executing for a very long time.

The hook stays running forever, days or weeks and blocks the charm from any further actions. The code appears to have a 10 minute timeout but this was observed not to apply in practice. (Perhaps the timeout was added after the version in use on some sites?)

= Analysis of the cause =
On startup for a busy server with a large number of OSDs, the ceph-disk process sometimes fails to start in a timely manner -- or -- at all.

There are a number of causes of this that will be filed in a separate bug but it comes down to the following observed situations
(a) They are starting but have not yet finished starting before config_changed fires - particularly likely on hyper-converged nodes (this is a race condition generally seen where the charm assumes OSDs are always mounted then they have not really checked)
(b) They timeout while starting due to the aggressive 120 second timeout which sometimes is not enough (this is a bug)
(c) They fail while starting due to "ceph_disk.main.Error: Error: No cluster conf found in /etc/ceph with fsid be6f1d22-6abb-11e8-a411-00163e11b3c4" (this is a bug)

The root cause is mostly irrelevant however, if for any reason the Ceph OSD disk FS is not mounted when config_changed is called the charm incorrectly assumes an upgrade to jewel is in progress but previously failed and then attempts to resume it. It then gets stuck waiting on another unit to signal it's upgrade has completed - but that never happens since the other unit is not in an upgrading state.

The cause is that check_for_upgrade calls ceph.dirs_need_ownership_update which checks if any directory in /var/lib/ceph/osd/ is owned by root:root - if so, it returns yes that an upgrade is needed due to resuming_upgrade and then wants to do the chown from root to the ceph user.

However this condition is also true if the OSD simply isn't mounted yet, as the underlying mountpoint directory exists and is owned by root:root (which would later be replaced with the owner of the Ceph FS once it is mounted).

When this condition occurs, ceph.roll_osd_cluster is called which then calls into wait_on_previous_node and waits (and you will observe a message like "Waiting on <other_host>to finish upgrading" in the juju status).

The workaround is to restart any failed ceph-disk@xxxx units, ensure the relevant ceph-osd@NN then starts, and then kill the config-changed hook. Next time it runs, it will work as expected.

In addition to this issue, it was observed in production that even after a longer period of time (20-30 minutes) the hook was still executing and stuck. It's not clear why the 10 minute timeout did not occur.

This occurred in production on a production cloud, and I also once triggered it in my test bench where I have 4 VMs using an openstack-base setup - unfortunately I do not currently have lots from either case but the above describes the scenario fairly accurately.

Lastly, I tried to create an easy reproducer for this scenario using the latest charm, however when unmounting the filesystem you now seem to trigger "Non-pristine devices detected" instead due to the recent changes related to that. However I did manage to trigger this issue with both a version of the charm with and without that code somehow. So the underlying issue should still be fixed - although this change may mask it in most cases.

Additionally, one of the items I raised at the start of the bug is the general race that the charm assumes all OSDs are mounted when a hook runs, which may not be the case. There are more general changes needed to fix logic like that, but thiss chown case is still true, and udevadm settle won't fix the case where the ceph-disk service times out (as opposed to isn't yet finished executing).

See original description

Tags:

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-07-03:

Reason for the timeout not working is possibly Bug #1664435 ?

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-07-03:

May also need to target bug to charm-ceph (the helpers)

tags:

added: sts

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2018-07-09:

The following is the log on which OSD is not yet mounted due to 120 second timeout of ceph-disk@service

May 22 06:05:19 cephosd06 systemd[1]: Starting Ceph disk activation: /dev/sdh1...
May 22 06:05:30 cephosd06 sh[3926]: main_trigger: main_activate: path = /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: get_dm_uuid: get_dm_uuid /dev/sdh1 uuid path is /sys/dev/block/8:113/dm/uuid
May 22 06:05:30 cephosd06 sh[3926]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: mount: Mounting /dev/sdh1 on /var/lib/ceph/tmp/mnt.0xG2_W with options noatime,inode64
May 22 06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -t xfs -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/tmp/mnt.0xG2_W
May 22 06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/osd/ceph-45
May 22 06:05:30 cephosd06 ceph-osd[8052]: starting osd.45 at :/0 osd_data /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journal
May 22 06:05:31 cephosd06 sh[6944]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdh1', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f574b3fc7d0>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
May 22 06:05:31 cephosd06 sh[6944]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: main_trigger: trigger /dev/sdh1 parttype 4fbd7e29-9d25-41b8-afd0-062c0ceff05d uuid ff8f7341-1c1e-4912-b680-41fd6999fcc8
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /usr/sbin/ceph-disk --verbose activate /dev/sdh1
May 22 06:07:20 cephosd06 systemd[1]: <email address hidden>: Main process exited, code=exited, status=124/n/a
May 22 06:07:20 cephosd06 systemd[1]: Failed to start Ceph disk activation: /dev/sdh1.

The following is the log on which OSD is not yet mounted due to 120 second timeout of ceph-disk@service

May 22 06:05:19 cephosd06 systemd[1]: Starting Ceph disk activation: /dev/sdh1...
May 22 06:05:30 cephosd06 sh[3926]: main_trigger: main_activate: path = /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: get_dm_uuid: get_dm_uuid /dev/sdh1 uuid path is /sys/dev/block/8:113/dm/uuid
May 22 06:05:30 cephosd06 sh[3926]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: mount: Mounting /dev/sdh1 on /var/lib/ceph/tmp/mnt.0xG2_W with options noatime,inode64
May 22 06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -t xfs -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/tmp/mnt.0xG2_W
May 22 06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/osd/ceph-45
May 22 06:05:30 cephosd06 ceph-osd[8052]: starting osd.45 at :/0 osd_data /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journal
May 22 06:05:31 cephosd06 sh[6944]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdh1', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f574b3fc7d0>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
May 22 06:05:31 cephosd06 sh[6944]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: main_trigger: trigger /dev/sdh1 parttype 4fbd7e29-9d25-41b8-afd0-062c0ceff05d uuid ff8f7341-1c1e-4912-b680-41fd6999fcc8
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /usr/sbin/ceph-disk --verbose activate /dev/sdh1
May 22 06:07:20 cephosd06 systemd[1]: ceph-disk@dev-sdh1.service: Main process exited, code=exited, status=124/n/a
May 22 06:07:20 cephosd06 systemd[1]: Failed to start Ceph disk activation: /dev/sdh1.

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-08-02:

Some clarity for triage:
This makes it difficult to reboot ceph-osd nodes cleanly, and should probably be addressed with reasonable priority.

Some extra information I've learned that seems to make this trigger more often, is in production deployments where the root disk is a HDD (e.g. not SSD) with no bcache and it's hyperconverged and has multiple containers. The disk load on the root disk causes lots of things at startup to run slow and makes bugs like this more likely to trigger.

Edward Hope-Morley (hopem) on 2018-10-01

Changed in charm-ceph-osd:
importance:	Undecided → High

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2018-10-01:

Related bug that also contributes to startup woes: bug 1783113

Changed in charm-ceph-osd:
milestone:	none → 18.11

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-10-03:

Workaround for this issue by changing the owner of the underlying directory, using bind mounts to see through to it:

root@maas-node06:~# mount --bind /var/lib/ceph /mnt
root@maas-node06:~# chown ceph:ceph /mnt/osd/ceph-*
root@maas-node06:~# umount /mnt

Revision history for this message

Takashi Sogabe (sogabe) wrote on 2018-10-25:

Hello Trent,
I also encountered the issue in my cluster. A simple modification of the check_for_upgrade() for bypassing problematic logic worked well for the servers.

I wonder why charm logic of config_changed->check_for_upgrade prevents servers from invoking ceph-disk@.
My understanding is that in non-juju environment, corresponding udev-rules handle events of adding disks, then invoke ceph-disk at the boot process.

Can anyone please summarize the entire process of boot sequence within the charm?

David Ames (thedac) on 2018-11-20

Changed in charm-ceph-osd:
milestone:	18.11 → 19.04

Trent Lloyd (lathiat) on 2018-11-23

Changed in charm-ceph-osd:
status:	New → Confirmed

Trent Lloyd (lathiat) on 2018-11-23

description:

updated

Trent Lloyd (lathiat) on 2019-04-04

Changed in charm-ceph-osd:
assignee:	nobody → Trent Lloyd (lathiat)

Trent Lloyd (lathiat) on 2019-04-09

Changed in charm-ceph-osd:
status:	Confirmed → In Progress

Revision history for this message

Trent Lloyd (lathiat) wrote on 2019-04-09:

Filed Bug #1823822 to handle improving the check_for_upgrade resume logic.

In this bug, I will fix the problem that an empty OSD directory needs an ownership change, which solves this problem 99% of the time.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-09: Fix proposed to charm-ceph-osd (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/651105

David Ames (thedac) on 2019-04-17

Changed in charm-ceph-osd:
milestone:	19.04 → 19.07

David Ames (thedac) on 2019-08-12

Changed in charm-ceph-osd:
milestone:	19.07 → 19.10

David Ames (thedac) on 2019-10-24

Changed in charm-ceph-osd:
milestone:	19.10 → 20.01

Alex Kavanagh (ajkavanagh) on 2019-11-08

tags:

added: series-upgrade

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-12-03:

#11

This bug seems to be in limbo; the fix for it is held at -1; Trent, please could you take a look at the review and Chris' (icey) comments re: being overtaken by events, and see if it is still current and that this bug still exists? Thanks.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-10: Change abandoned on charm-ceph-osd (master)

#12

Change abandoned by Trent Lloyd (<email address hidden>) on branch: master
Review: https://review.opendev.org/651105
Reason: This fix was already committed to the charm in https://review.opendev.org/#/c/685879/

Alex Kavanagh (ajkavanagh) on 2019-12-10

Changed in charm-ceph-osd:
milestone:	20.01 → 19.10

Trent Lloyd (lathiat) on 2019-12-10

Changed in charm-ceph-osd:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1821113

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.