config_changed->check_for_upgrade incorrectly assumes upgrade in progress if OSD not yet mounted / ceph-disk@ has not started

Bug #1779828 reported by Trent Lloyd
46
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Released
High
Trent Lloyd

Bug Description

The charm sometimes gets stuck waiting for another host to upgrade, even when an upgrade is not really in progress. You will observe "Waiting on <other_host> to finish upgrading" in the juju unit status, and the config-changed hooks stays executing for a very long time.

The hook stays running forever, days or weeks and blocks the charm from any further actions. The code appears to have a 10 minute timeout but this was observed not to apply in practice. (Perhaps the timeout was added after the version in use on some sites?)

 = Analysis of the cause =
On startup for a busy server with a large number of OSDs, the ceph-disk process sometimes fails to start in a timely manner -- or -- at all.

There are a number of causes of this that will be filed in a separate bug but it comes down to the following observed situations
 (a) They are starting but have not yet finished starting before config_changed fires - particularly likely on hyper-converged nodes (this is a race condition generally seen where the charm assumes OSDs are always mounted then they have not really checked)
 (b) They timeout while starting due to the aggressive 120 second timeout which sometimes is not enough (this is a bug)
 (c) They fail while starting due to "ceph_disk.main.Error: Error: No cluster conf found in /etc/ceph with fsid be6f1d22-6abb-11e8-a411-00163e11b3c4" (this is a bug)

The root cause is mostly irrelevant however, if for any reason the Ceph OSD disk FS is not mounted when config_changed is called the charm incorrectly assumes an upgrade to jewel is in progress but previously failed and then attempts to resume it. It then gets stuck waiting on another unit to signal it's upgrade has completed - but that never happens since the other unit is not in an upgrading state.

The cause is that check_for_upgrade calls ceph.dirs_need_ownership_update which checks if any directory in /var/lib/ceph/osd/ is owned by root:root - if so, it returns yes that an upgrade is needed due to resuming_upgrade and then wants to do the chown from root to the ceph user.

However this condition is also true if the OSD simply isn't mounted yet, as the underlying mountpoint directory exists and is owned by root:root (which would later be replaced with the owner of the Ceph FS once it is mounted).

When this condition occurs, ceph.roll_osd_cluster is called which then calls into wait_on_previous_node and waits (and you will observe a message like "Waiting on <other_host>to finish upgrading" in the juju status).

The workaround is to restart any failed ceph-disk@xxxx units, ensure the relevant ceph-osd@NN then starts, and then kill the config-changed hook. Next time it runs, it will work as expected.

In addition to this issue, it was observed in production that even after a longer period of time (20-30 minutes) the hook was still executing and stuck. It's not clear why the 10 minute timeout did not occur.

This occurred in production on a production cloud, and I also once triggered it in my test bench where I have 4 VMs using an openstack-base setup - unfortunately I do not currently have lots from either case but the above describes the scenario fairly accurately.

Lastly, I tried to create an easy reproducer for this scenario using the latest charm, however when unmounting the filesystem you now seem to trigger "Non-pristine devices detected" instead due to the recent changes related to that. However I did manage to trigger this issue with both a version of the charm with and without that code somehow. So the underlying issue should still be fixed - although this change may mask it in most cases.

Additionally, one of the items I raised at the start of the bug is the general race that the charm assumes all OSDs are mounted when a hook runs, which may not be the case. There are more general changes needed to fix logic like that, but thiss chown case is still true, and udevadm settle won't fix the case where the ceph-disk service times out (as opposed to isn't yet finished executing).

Revision history for this message
Trent Lloyd (lathiat) wrote :

Reason for the timeout not working is possibly Bug #1664435 ?

Revision history for this message
Trent Lloyd (lathiat) wrote :

May also need to target bug to charm-ceph (the helpers)

tags: added: sts
Revision history for this message
Hua Zhang (zhhuabj) wrote :

The following is the log on which OSD is not yet mounted due to 120 second timeout of ceph-disk@service

May 22 06:05:19 cephosd06 systemd[1]: Starting Ceph disk activation: /dev/sdh1...
May 22 06:05:30 cephosd06 sh[3926]: main_trigger: main_activate: path = /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: get_dm_uuid: get_dm_uuid /dev/sdh1 uuid path is /sys/dev/block/8:113/dm/uuid
May 22 06:05:30 cephosd06 sh[3926]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: mount: Mounting /dev/sdh1 on /var/lib/ceph/tmp/mnt.0xG2_W with options noatime,inode64
May 22 06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -t xfs -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/tmp/mnt.0xG2_W
May 22 06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/osd/ceph-45
May 22 06:05:30 cephosd06 ceph-osd[8052]: starting osd.45 at :/0 osd_data /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journal
May 22 06:05:31 cephosd06 sh[6944]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdh1', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f574b3fc7d0>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
May 22 06:05:31 cephosd06 sh[6944]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: main_trigger: trigger /dev/sdh1 parttype 4fbd7e29-9d25-41b8-afd0-062c0ceff05d uuid ff8f7341-1c1e-4912-b680-41fd6999fcc8
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /usr/sbin/ceph-disk --verbose activate /dev/sdh1
May 22 06:07:20 cephosd06 systemd[1]: <email address hidden>: Main process exited, code=exited, status=124/n/a
May 22 06:07:20 cephosd06 systemd[1]: Failed to start Ceph disk activation: /dev/sdh1.

Revision history for this message
Trent Lloyd (lathiat) wrote :

Some clarity for triage:
This makes it difficult to reboot ceph-osd nodes cleanly, and should probably be addressed with reasonable priority.

Some extra information I've learned that seems to make this trigger more often, is in production deployments where the root disk is a HDD (e.g. not SSD) with no bcache and it's hyperconverged and has multiple containers. The disk load on the root disk causes lots of things at startup to run slow and makes bugs like this more likely to trigger.

Changed in charm-ceph-osd:
importance: Undecided → High
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Related bug that also contributes to startup woes: bug 1783113

Changed in charm-ceph-osd:
milestone: none → 18.11
Revision history for this message
Trent Lloyd (lathiat) wrote :

Workaround for this issue by changing the owner of the underlying directory, using bind mounts to see through to it:

root@maas-node06:~# mount --bind /var/lib/ceph /mnt
root@maas-node06:~# chown ceph:ceph /mnt/osd/ceph-*
root@maas-node06:~# umount /mnt

Revision history for this message
Takashi Sogabe (sogabe) wrote :

Hello Trent,
I also encountered the issue in my cluster. A simple modification of the check_for_upgrade() for bypassing problematic logic worked well for the servers.

I wonder why charm logic of config_changed->check_for_upgrade prevents servers from invoking ceph-disk@.
My understanding is that in non-juju environment, corresponding udev-rules handle events of adding disks, then invoke ceph-disk at the boot process.

Can anyone please summarize the entire process of boot sequence within the charm?

David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 18.11 → 19.04
Trent Lloyd (lathiat)
Changed in charm-ceph-osd:
status: New → Confirmed
Trent Lloyd (lathiat)
description: updated
Trent Lloyd (lathiat)
Changed in charm-ceph-osd:
assignee: nobody → Trent Lloyd (lathiat)
Trent Lloyd (lathiat)
Changed in charm-ceph-osd:
status: Confirmed → In Progress
Revision history for this message
Trent Lloyd (lathiat) wrote :

Filed Bug #1823822 to handle improving the check_for_upgrade resume logic.

In this bug, I will fix the problem that an empty OSD directory needs an ownership change, which solves this problem 99% of the time.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (master)

Fix proposed to branch: master
Review: https://review.openstack.org/651105

David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 19.04 → 19.07
David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 19.10 → 20.01
tags: added: series-upgrade
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

This bug seems to be in limbo; the fix for it is held at -1; Trent, please could you take a look at the review and Chris' (icey) comments re: being overtaken by events, and see if it is still current and that this bug still exists? Thanks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-osd (master)

Change abandoned by Trent Lloyd (<email address hidden>) on branch: master
Review: https://review.opendev.org/651105
Reason: This fix was already committed to the charm in https://review.opendev.org/#/c/685879/

Changed in charm-ceph-osd:
milestone: 20.01 → 19.10
Trent Lloyd (lathiat)
Changed in charm-ceph-osd:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.